moka Consider switching to TinyUFO

See https://crates.io/crates/TinyUFO. I think we would see a good perf boost from this

Apr 05 '24 18:04 zaidoon1

Actually that policy they use is called W-TinyLFU and it's documented on this wiki as a future feature. Their implementation has a few gaps that ideally would be addressed,

It does not adaptively adjust the region sizes or use jitter to avoid getting stuck by a flawed admission decision. These improve the hit rate across a variety of workloads by adjusting to more optimal behavior rather than using a static one. It would be nice if both libraries could take advantage of those design improvements.
It appears from a casual read of the code that an update that grows or shrinks the entry concurrently with it being moved across queues could lead to inconsistency in the region sizes. That's a tricky race that is easy to overlook.
A fully lock-free design may be increase complexity of handling races, be brittle to evolve, and may not outperform a lock-based one. For example, the cache read can be lock-free using ring buffers to record the policy updates while applying the policy maintenance under a lock. In either case eviction does not offer much parallelization as it serializes on the queues. Thus one can get the performance benefits of a lock-free design while retaining the advantages of more advanced functionality of a lock-based one (adaptive regions, O(1) expiration, etc). For example, Java's Caffeine scored 940M reads/s and 585M/s mixed using 16 threads. The policy is less important (as proven first using LRU), it is the concurrency design that matters. The Pingora results are also excellent, it is just unclear if they've limited the cache's flexibility and made it more difficult to resolve bugs by taking too brittle of a concurrency design approach. Or, it might be a good shortcut to their performance goals if they don't want to include more advanced functionality (so, as usual, it all depends).

Apr 06 '24 17:04 ben-manes

Hi. Thank you for the information. I reviewed the design of TinyUFO and S3-FIFO. I also compared the performance of a TinyUFO implementation ([email protected]) against [email protected] using my mokabench tool with the ARC-S3 trace dataset.

Here is my first thought:

We can learn from its design and could apply some of them to moka.
However we do not have to rush as the performance gain might be small and we have very limited development bandwidth.

Before looking at the mokabench result, let me clarify the current development priority of moka.

(Ordered by higher to lower priorities)

Provide high cache hit rate.
Easy to use.
- e.g. No lock is needed in the user code.
Convenient features.
- e.g. Expirations, atomic insert (get_with, and_compute_with, etc.), eviction listener, iterator, cache statistics.
Small memory footprint.
Short compile time and small binary size.
Small performance overhead compared to a concurrent hash table.

As for 1 (hit rate), I believe TinyUFO will provide similar hit rate to W-TinyLFU with fixed size of W. High hit rate is critical for application performance as the cache miss penalty (the extra latency to get the data from a slower media) is much greater than the latency to read the data from the cache.

From the mokabench result I am going to show, the average latency per cache operation (read or write) were the followings:

moka: Between ~700 ns and ~850 ns (nanoseconds).
TinyUFO: Between ~400 ns and ~700 ns.

As for examples of cache miss penalties, here is a quote from a book: The Systems Performance: Enterprise and the Cloud, 2nd Edition by Brendan Gregg

systems-performance-p26

The "latency" column shows example system latencies, and the "scaled" column shows the latencies scaled to an imaginary system in which a CPU cycle takes one full second.

By using the same scale, 700 ns for a cache operation in real life, takes 42 minutes.
SSD: 9 — 90 hours
HDD: 1 — 12 months
Internet: San Francisco to New York: 4 years

As you can see, 42 minutes is almost nothing compared to the latencies of accessing HDD or Internet.

So, in general, 1 (hit rate) is much more important than 6 (small performance overhead). And by design, TinyUFO and W-TinyLFU (with fixed size of W) will provide competing hit rates.

As for 4 (memory footprint), TinyUFO can do better than W-TinyLFU as they use queues and doubly linked lists respectively. A queue can be array-based and have smaller memory footprint than a doubly linked list. However, queue is not a suitable data structure to provide efficient algorithm for entry expirations, so moka may have to continue using doubly linked lists to support expirations. This is a trade-off between 4 and 3 (convenient features).

There is a trade-off between 3 and 6 (small performance overhead) too. So, moka that prioritizes 3 will generally perform less well than other cache implementations that offer simpler functionality. But this will not be a problem in real world applications as 1 (hit rate) has much greater impact to application performance than 6.

OK. Having say that. Here is the mokabench result.

Command: cargo run -F tiny-ufo --release -- -n 1
Number of clients (-n) was 1 (single thread).
Ran on a Linux x86_64 PC with Intel Core i7-12700F.
The memory usage (RSS) is taken using the top command.
- This is the whole memory used by the mokabench process.

Cache	Max Capacity	Estimated number of keys for LFU	Clients	Inserts	Reads	Hit Ratio	Memory Usage (RSS)	Duration Secs	Avg. nanosecs per operation
Moka Sync Cache	100,000	n/a	1	14,694,850	16,407,702	10.439%	572,852k	21.595 s	694 ns
TinyUFO	100,000	100,000	1	15,685,418	16,407,702	4.402%	570,560k	13.077 s	407 ns
TinyUFO	100,000	1,000,000	1	15,074,058	16,407,702	8.128%	633,272k	21.706 s	689 ns

Moka Sync Cache	400,000	n/a	1	9,439,274	16,407,702	42.470%	725,132k	21.926 s	848 ns
TinyUFO	400,000	400,000	1	11,456,282	16,407,702	30.177%	723,656k	17.174 s	616 ns
TinyUFO	400,000	4,000,000	1	10,948,476	16,407,702	33.272%	994,580k	19.670 s	719 ns

Moka Sync Cache	800,000	n/a	1	4,868,066	16,407,702	70.331%	927,520k	15.246 s	716 ns
TinyUFO	800,000	800,000	1	5,597,449	16,407,702	65.885%	929,512k	12.203 s	555 ns
TinyUFO	800,000	8,000,000	1	5,597,708	16,407,702	65.884%	1.4g	13.530 s	614 ns

The TinyUFO implementation was up to twice as fast as moka.
By some reason, TinyUFO provided lower hit-miss ratio than moka.
- TinyUFO is expected to provide about the same hit-miss ratio to moka.
By some reason, TinyUFO used about the same amount of memory to moka.
- TinyUFO is expected to use smaller amount of memory than moka.
- I will probably need to subtract the baseline memory used by mokabench's driver.
  - https://github.com/moka-rs/mokabench/pull/6
  - fully pre-process the benchmark commands before performing the benchmark.

It seems the current TinyUFO implementation can be improved for better hit-miss ratio and smaller memory footprint. But even though it is improved, I am afraid that the performance gain might be small.

Apr 12 '24 15:04 tatsuya6502

moka moka copied to clipboard

Consider switching to TinyUFO

moka
moka copied to clipboard