CacheLib
CacheLib copied to clipboard
Cache performance is low when num items is high
Hi, I am using Cachelib to store all items that are requested but missing from underlying layer. So the values of these items are empty. As a result, for 40GB of data, we will end up with around 2e7 items in the cache, which corresponds to the bucket power of 26. I measured the cache hit and cache miss latency. Compared with another cache using cachelib which stores regular items, this cache performance really suffers.
Cache type | Cache hit latency | Cache miss latency | items in cache | cache hit ratio |
---|---|---|---|---|
Cache storing regular items | 2us | 1us | 2^22 | 7% |
Cache storing empty values | 5us | 4us | 2^25 | 44% |
BTW, I set the cache size large enough so no evictions happens. The increased cache latency actually diluted the value of using cache. Can you give some guidance on this?
Huh can you share your full cache config? (Dump the serialized version of CacheAllocatorConfig).
Other questions:
- Are you sending similar load to both caches? (cache operations per sec are similar?)
- Does 7% mean 93% are misses in cachelib, or these are misses due to empty values?
- Are all empty values of the same size (key size + value size always identical?)? What about regular items (what's its size distribution?)
Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.
https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache
(1) About the cache config. I only set the cache size and config.setAccessConfig(numCacheItems)
. All others are default. No hybrid cache.
(2) I run the benchmark making the same query from the same data source. The benchmark will send workload as fast as possible.
(3) My bad about the hit ratio. Actually, 7% is for the cache storing regular items. The hit ratio for the cache storing empty values is 44%.
(4) All empty values are empty strings "". Key sizes are always identical. For regular items, the average value size is 100B, the key size is 16B.
For the reference, with another query, the empty value cache can have very good performance:
Cache type | Cache hit latency | Cache miss latency | items in cache | cache hit ratio |
---|---|---|---|---|
Cache storing empty values(another query) | 1us | 1us | 2^20 | 85% |
Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.
https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache
Let me try this.
I am trying to use the ccache, but got compilation error:

Here is a snippet of my code:
struct MyKey {
uint32_t key;
MyKey(uint32_t from) {
key = from;
}
bool operator==(const MyKey& other) const {
return key == other.key;
}
bool isEmpty() const {
return key == 0;
}
};
....
bool getKey() {
MyKey mykey = 12;
auto res = cCache_->get(mykey); <----- Here is the line where the compiler error is
return res == facebook::cachelib::CCacheReturn::FOUND;
}
Any hints on this?
The compilation error only occurs when I declare the ccache as NoValue
type like CCacheCreator<CCacheAllocator, MyKey>::type
. Once I specify the value type, it can compile.
@wenhaocs: huh the compiler errors look like u hit some sort of internal bug in the compiler? (On our end, we have internal use-cases specifying NoType so it should work). What happens if you duplicate the "NoValue" definition and pass in your own "no value" type?
Regarding the slower than expected lookup latency, it may be because all the keys are in the same allocation class. So each lookup (that promotes an item) will hit the same LRU lock. To validate this theory, in your test, can you try sharding your key space into 10 different cache pools?
Something like:
auto poolId = static_cast<PoolId>(hash_func(key) % 10);
auto item = cache->allocate(poolId, ...);
// ... insert into cache.
Hi @therealgymmy Here are a few findings and questions:
-
The
floating point exception
is caused bychar[0]
(let alone the Pedantic Warning from compiler). I copied theNoValue
definition into my code. After I changed it intochar[1]
. The compilation succeeds. My compilers are: gcc (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0 g++ (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0 -
If I copy the
NoValue
definition into my code, how can I fake CacheLib to let me passnullptr
value whenset
? The CacheLib code determines this internally:constexpr static bool kHasValues = !std::is_same<typename ValueDescriptor::Value, NoValue>::value;
-
After I change to use CCache to store negative keys. The performance is much better. The cache hit and miss latency are all sub-microseconds. But the overhead is still high if I choose to shard my key space into 10 different cache pools, around
4us
for cache hit,3us
for cache miss. A little bit improvement from the original case though. -
Seems
GlobalCacheStats
does not contain the stats ofCCache
.
I know CCache
is to be deprecated. Any guidance on my current situation?
- The floating point exception is caused by char[0] (let alone the Pedantic Warning from compiler).
Huh. I guess technically char[0] is illegal. But I expected compiler to handle it (giving us a zero-sided structure). We don't see this internally. But we switched to clang a while back.
Code: https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCacheDescriptor.h#L177 Found this stackoverflow post about zero-sized struct: https://stackoverflow.com/a/47352963
@wenhaocs: for now, you can work around this by using a value structure of 1 byte. U can just choose not to populate it.
- But the overhead is still high if I choose to shard my key space into 10 different cache pools, around 4us for cache hit, 3us for cache miss.
This is a bit odd. Do you have any instrumentation on your end that can tell u which code path is taking CPU cycles? Also maybe use perftools to see if we're incurring more cpu cache misses?
My original suggestion for sharding the pool is assuming we have contention on LRU lock. But perhaps it's elsewhere.
What if you size your hashtable bigger even for the workload with fewer items? E.g. Change both workloads to use a very large hashtable e.g. set bucket power to 2^30. (Keep the pools sharded too to rule out possibility of LRU contention). If both workloads are now at similar latency, it's probably due to cpu cache misses from a larger hash table (buckets are more spread out).
You can manually do this: https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/ChainedHashTable.h#L234
config.setAccessConfig({30 /* bucketPower */, 20 /* lockPower */});
- Seems GlobalCacheStats does not contain the stats of CCache.
Yeah that's right. CCache is intended to be used orthogonal to regular cachelib. You can get the stats directly from each CCache instance.
https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCache.h#L384
I know CCache is to be deprecated. Any guidance on my current situation?
It's in "maintenance mode" indefinitely. We have no plan to remove it but also no plan on adding additional features. We still have a few odd usage of this here and there internally.
Hi @therealgymmy Thanks for the reply.
Regarding the bucket power influence, using the benchmark limiting QPS, I do see downgraded performance when either we set it too high or too low. For example, in a test with 2^21 GETS per min, and 2^17 cached items, setting the bucket power to 15 or 30 will have similar performance, inferior to 25. But not such a big difference compared with using ccache.
I do not have perf tools on hands, may need help from other teams. Here are some other findings I have:
(1) The put latency of ccache is high compared with regular cache. In one experiment with fixed QPS benchmark, the avg cache put latency for ccache is 88us
, and 51us
for regular cache for negative items. I know I can bypass it by putting to cache async, but it may decrease the cache hit ratio when the QPS is extremely high, and may not yield good results as expected. Of course, when the QPS is low, it works perfectly. Is this higher put latency in ccache expected?
(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache. However, if I limit the QPS, the performance is good. For example, for the query that I mentioned early having bad performance, here is a summary of the p95
latency:
w/o cache | regular cache with 10 pools | regular cache with 1 pool | ccache |
---|---|---|---|
3575ms | 1855ms | 1811ms | 1699ms |
That means if I limit the QPS, not running as fast as possible, the regular cache for negative items is not far from ccache. If running stress test, there is a big difference between regular cache and ccache for negative items. Can it be related to the difference of handling multi-threading between ccache and regular cache?
Do you have any suggestions on the choices of cache for negative items, where storing keys is enough. If using regular cache, what's the guidance of creating pools?
(1) The put latency of ccache is high compared with regular cache
Yeah that's right. CCache put involves us re-writing the entire bucket (which contains several items) when evictions occur. You can tune this by changing the bucket size, but it is expected to be more expensive than evicting a regular item.
Compact cache's read performance is good only for small items because it performs copy-on-read. Object size should typically be less than a cacheline (128bytes)
(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache.
I do not really understand why this is. Let's debug this some more.
In cachebench tests, when I run a miss-only workload, I typically see higher throughput than workloads with high hit rate. Cachebench runs its operations synchronously so higher throughput typically also mean lower latency.
You can try it out by running cachebench with: https://github.com/facebook/CacheLib/blob/main/cachelib/cachebench/test_configs/throughput/miss-workload/ram_cache_get_throughput_all_misses.json
In general, we recommend compact cache for negative caching because it is much more space efficient. This is assuming that your negative cache will need to cache a much larger working set than a non-negative cache. (E.g. there're more "do-not-exists" in your system than "exists") Does this pattern sound like your usage?
Exactly, in our benchmark, "not-exists" can be 8x more than "exists".
@wenhaocs: Did you resolve this issue?
Yes. It can be closed for now. I will dig this further in the future.