moka
moka copied to clipboard
Consider switching to TinyUFO
See https://crates.io/crates/TinyUFO. I think we would see a good perf boost from this
Actually that policy they use is called W-TinyLFU and it's documented on this wiki as a future feature. Their implementation has a few gaps that ideally would be addressed,
- It does not adaptively adjust the region sizes or use jitter to avoid getting stuck by a flawed admission decision. These improve the hit rate across a variety of workloads by adjusting to more optimal behavior rather than using a static one. It would be nice if both libraries could take advantage of those design improvements.
- It appears from a casual read of the code that an update that grows or shrinks the entry concurrently with it being moved across queues could lead to inconsistency in the region sizes. That's a tricky race that is easy to overlook.
- A fully lock-free design may be increase complexity of handling races, be brittle to evolve, and may not outperform a lock-based one. For example, the cache read can be lock-free using ring buffers to record the policy updates while applying the policy maintenance under a lock. In either case eviction does not offer much parallelization as it serializes on the queues. Thus one can get the performance benefits of a lock-free design while retaining the advantages of more advanced functionality of a lock-based one (adaptive regions, O(1) expiration, etc). For example, Java's Caffeine scored 940M reads/s and 585M/s mixed using 16 threads. The policy is less important (as proven first using LRU), it is the concurrency design that matters. The Pingora results are also excellent, it is just unclear if they've limited the cache's flexibility and made it more difficult to resolve bugs by taking too brittle of a concurrency design approach. Or, it might be a good shortcut to their performance goals if they don't want to include more advanced functionality (so, as usual, it all depends).
Hi. Thank you for the information. I reviewed the design of TinyUFO and S3-FIFO. I also compared the performance of a TinyUFO implementation ([email protected]
) against [email protected]
using my mokabench
tool with the ARC-S3 trace dataset.
Here is my first thought:
- We can learn from its design and could apply some of them to
moka
. - However we do not have to rush as the performance gain might be small and we have very limited development bandwidth.
Before looking at the mokabench
result, let me clarify the current development priority of moka
.
(Ordered by higher to lower priorities)
- Provide high cache hit rate.
- Easy to use.
- e.g. No lock is needed in the user code.
- Convenient features.
- e.g. Expirations, atomic insert (
get_with
,and_compute_with
, etc.), eviction listener, iterator, cache statistics.
- e.g. Expirations, atomic insert (
- Small memory footprint.
- Short compile time and small binary size.
- Small performance overhead compared to a concurrent hash table.
As for 1 (hit rate), I believe TinyUFO will provide similar hit rate to W-TinyLFU with fixed size of W. High hit rate is critical for application performance as the cache miss penalty (the extra latency to get the data from a slower media) is much greater than the latency to read the data from the cache.
From the mokabench result I am going to show, the average latency per cache operation (read or write) were the followings:
-
moka
: Between ~700 ns and ~850 ns (nanoseconds). -
TinyUFO
: Between ~400 ns and ~700 ns.
As for examples of cache miss penalties, here is a quote from a book: The Systems Performance: Enterprise and the Cloud, 2nd Edition by Brendan Gregg
The "latency" column shows example system latencies, and the "scaled" column shows the latencies scaled to an imaginary system in which a CPU cycle takes one full second.
- By using the same scale, 700 ns for a cache operation in real life, takes 42 minutes.
- SSD: 9 — 90 hours
- HDD: 1 — 12 months
- Internet: San Francisco to New York: 4 years
As you can see, 42 minutes is almost nothing compared to the latencies of accessing HDD or Internet.
So, in general, 1 (hit rate) is much more important than 6 (small performance overhead). And by design, TinyUFO and W-TinyLFU (with fixed size of W) will provide competing hit rates.
As for 4 (memory footprint), TinyUFO can do better than W-TinyLFU as they use queues and doubly linked lists respectively. A queue can be array-based and have smaller memory footprint than a doubly linked list. However, queue is not a suitable data structure to provide efficient algorithm for entry expirations, so moka
may have to continue using doubly linked lists to support expirations. This is a trade-off between 4 and 3 (convenient features).
There is a trade-off between 3 and 6 (small performance overhead) too. So, moka
that prioritizes 3 will generally perform less well than other cache implementations that offer simpler functionality. But this will not be a problem in real world applications as 1 (hit rate) has much greater impact to application performance than 6.
OK. Having say that. Here is the mokabench result.
- Command:
cargo run -F tiny-ufo --release -- -n 1
- Number of clients (
-n
) was 1 (single thread). - Ran on a Linux x86_64 PC with Intel Core i7-12700F.
- The memory usage (RSS) is taken using the
top
command.- This is the whole memory used by the
mokabench
process.
- This is the whole memory used by the
Cache | Max Capacity | Estimated number of keys for LFU | Clients | Inserts | Reads | Hit Ratio | Memory Usage (RSS) | Duration Secs | Avg. nanosecs per operation |
---|---|---|---|---|---|---|---|---|---|
Moka Sync Cache | 100,000 | n/a | 1 | 14,694,850 | 16,407,702 | 10.439% | 572,852k | 21.595 s | 694 ns |
TinyUFO | 100,000 | 100,000 | 1 | 15,685,418 | 16,407,702 | 4.402% | 570,560k | 13.077 s | 407 ns |
TinyUFO | 100,000 | 1,000,000 | 1 | 15,074,058 | 16,407,702 | 8.128% | 633,272k | 21.706 s | 689 ns |
Moka Sync Cache | 400,000 | n/a | 1 | 9,439,274 | 16,407,702 | 42.470% | 725,132k | 21.926 s | 848 ns |
TinyUFO | 400,000 | 400,000 | 1 | 11,456,282 | 16,407,702 | 30.177% | 723,656k | 17.174 s | 616 ns |
TinyUFO | 400,000 | 4,000,000 | 1 | 10,948,476 | 16,407,702 | 33.272% | 994,580k | 19.670 s | 719 ns |
Moka Sync Cache | 800,000 | n/a | 1 | 4,868,066 | 16,407,702 | 70.331% | 927,520k | 15.246 s | 716 ns |
TinyUFO | 800,000 | 800,000 | 1 | 5,597,449 | 16,407,702 | 65.885% | 929,512k | 12.203 s | 555 ns |
TinyUFO | 800,000 | 8,000,000 | 1 | 5,597,708 | 16,407,702 | 65.884% | 1.4g | 13.530 s | 614 ns |
- The
TinyUFO
implementation was up to twice as fast asmoka
. - By some reason,
TinyUFO
provided lower hit-miss ratio thanmoka
.-
TinyUFO
is expected to provide about the same hit-miss ratio tomoka
.
-
- By some reason,
TinyUFO
used about the same amount of memory tomoka
.-
TinyUFO
is expected to use smaller amount of memory thanmoka
. - I will probably need to subtract the baseline memory used by
mokabench
's driver.- https://github.com/moka-rs/mokabench/pull/6
-
fully pre-process the benchmark commands before performing the benchmark.
-
It seems the current TinyUFO
implementation can be improved for better hit-miss ratio and smaller memory footprint. But even though it is improved, I am afraid that the performance gain might be small.