CacheLib Cache performance is low when num items is high

Hi, I am using Cachelib to store all items that are requested but missing from underlying layer. So the values of these items are empty. As a result, for 40GB of data, we will end up with around 2e7 items in the cache, which corresponds to the bucket power of 26. I measured the cache hit and cache miss latency. Compared with another cache using cachelib which stores regular items, this cache performance really suffers.

Cache type	Cache hit latency	Cache miss latency	items in cache	cache hit ratio
Cache storing regular items	2us	1us	2^22	7%
Cache storing empty values	5us	4us	2^25	44%

BTW, I set the cache size large enough so no evictions happens. The increased cache latency actually diluted the value of using cache. Can you give some guidance on this?

Jul 25 '22 18:07 wenhaocs

Huh can you share your full cache config? (Dump the serialized version of CacheAllocatorConfig).

Other questions:

Are you sending similar load to both caches? (cache operations per sec are similar?)
Does 7% mean 93% are misses in cachelib, or these are misses due to empty values?
Are all empty values of the same size (key size + value size always identical?)? What about regular items (what's its size distribution?)

Jul 25 '22 18:07 therealgymmy

Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.

https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache

Jul 25 '22 18:07 therealgymmy

(1) About the cache config. I only set the cache size and config.setAccessConfig(numCacheItems). All others are default. No hybrid cache. (2) I run the benchmark making the same query from the same data source. The benchmark will send workload as fast as possible. (3) My bad about the hit ratio. Actually, 7% is for the cache storing regular items. The hit ratio for the cache storing empty values is 44%. (4) All empty values are empty strings "". Key sizes are always identical. For regular items, the average value size is 100B, the key size is 16B.

Jul 25 '22 19:07 wenhaocs

For the reference, with another query, the empty value cache can have very good performance:

Cache type	Cache hit latency	Cache miss latency	items in cache	cache hit ratio
Cache storing empty values(another query)	1us	1us	2^20	85%

Jul 25 '22 19:07 wenhaocs

Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.

https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache

Let me try this.

Jul 25 '22 19:07 wenhaocs

I am trying to use the ccache, but got compilation error:

Here is a snippet of my code:

struct MyKey {
  uint32_t key; 
  MyKey(uint32_t from) {
    key = from;
  }

  bool operator==(const MyKey& other) const {
    return key == other.key;
  }

  bool isEmpty() const {
    return key == 0;
  }
};

....
bool getKey() {
    MyKey mykey = 12;
    auto res = cCache_->get(mykey);                           <----- Here is the line where the compiler error is
    return res == facebook::cachelib::CCacheReturn::FOUND;
}

Any hints on this?

Jul 25 '22 23:07 wenhaocs

The compilation error only occurs when I declare the ccache as NoValue type like CCacheCreator<CCacheAllocator, MyKey>::type. Once I specify the value type, it can compile.

Jul 25 '22 23:07 wenhaocs

@wenhaocs: huh the compiler errors look like u hit some sort of internal bug in the compiler? (On our end, we have internal use-cases specifying NoType so it should work). What happens if you duplicate the "NoValue" definition and pass in your own "no value" type?

Regarding the slower than expected lookup latency, it may be because all the keys are in the same allocation class. So each lookup (that promotes an item) will hit the same LRU lock. To validate this theory, in your test, can you try sharding your key space into 10 different cache pools?

Something like:

auto poolId = static_cast<PoolId>(hash_func(key) % 10);
auto item = cache->allocate(poolId, ...);
// ... insert into cache.

Jul 26 '22 02:07 therealgymmy

Hi @therealgymmy Here are a few findings and questions:

The floating point exception is caused by char[0] (let alone the Pedantic Warning from compiler). I copied the NoValue definition into my code. After I changed it into char[1]. The compilation succeeds. My compilers are: gcc (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0 g++ (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0
If I copy the NoValue definition into my code, how can I fake CacheLib to let me pass nullptr value when set? The CacheLib code determines this internally: constexpr static bool kHasValues = !std::is_same<typename ValueDescriptor::Value, NoValue>::value;
After I change to use CCache to store negative keys. The performance is much better. The cache hit and miss latency are all sub-microseconds. But the overhead is still high if I choose to shard my key space into 10 different cache pools, around 4us for cache hit, 3us for cache miss. A little bit improvement from the original case though.
Seems GlobalCacheStats does not contain the stats of CCache.

I know CCache is to be deprecated. Any guidance on my current situation?

Jul 28 '22 00:07 wenhaocs

The floating point exception is caused by char[0] (let alone the Pedantic Warning from compiler).

Huh. I guess technically char[0] is illegal. But I expected compiler to handle it (giving us a zero-sided structure). We don't see this internally. But we switched to clang a while back.

Code: https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCacheDescriptor.h#L177 Found this stackoverflow post about zero-sized struct: https://stackoverflow.com/a/47352963

@wenhaocs: for now, you can work around this by using a value structure of 1 byte. U can just choose not to populate it.

But the overhead is still high if I choose to shard my key space into 10 different cache pools, around 4us for cache hit, 3us for cache miss.

This is a bit odd. Do you have any instrumentation on your end that can tell u which code path is taking CPU cycles? Also maybe use perftools to see if we're incurring more cpu cache misses?

My original suggestion for sharding the pool is assuming we have contention on LRU lock. But perhaps it's elsewhere.

What if you size your hashtable bigger even for the workload with fewer items? E.g. Change both workloads to use a very large hashtable e.g. set bucket power to 2^30. (Keep the pools sharded too to rule out possibility of LRU contention). If both workloads are now at similar latency, it's probably due to cpu cache misses from a larger hash table (buckets are more spread out).

You can manually do this: https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/ChainedHashTable.h#L234

config.setAccessConfig({30 /* bucketPower */, 20 /* lockPower */});

Jul 28 '22 21:07 therealgymmy

Seems GlobalCacheStats does not contain the stats of CCache.

Yeah that's right. CCache is intended to be used orthogonal to regular cachelib. You can get the stats directly from each CCache instance.

https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCache.h#L384

I know CCache is to be deprecated. Any guidance on my current situation?

It's in "maintenance mode" indefinitely. We have no plan to remove it but also no plan on adding additional features. We still have a few odd usage of this here and there internally.

Jul 28 '22 21:07 therealgymmy

Hi @therealgymmy Thanks for the reply.

Regarding the bucket power influence, using the benchmark limiting QPS, I do see downgraded performance when either we set it too high or too low. For example, in a test with 2^21 GETS per min, and 2^17 cached items, setting the bucket power to 15 or 30 will have similar performance, inferior to 25. But not such a big difference compared with using ccache.

I do not have perf tools on hands, may need help from other teams. Here are some other findings I have:

(1) The put latency of ccache is high compared with regular cache. In one experiment with fixed QPS benchmark, the avg cache put latency for ccache is 88us, and 51us for regular cache for negative items. I know I can bypass it by putting to cache async, but it may decrease the cache hit ratio when the QPS is extremely high, and may not yield good results as expected. Of course, when the QPS is low, it works perfectly. Is this higher put latency in ccache expected?

(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache. However, if I limit the QPS, the performance is good. For example, for the query that I mentioned early having bad performance, here is a summary of the p95 latency:

w/o cache	regular cache with 10 pools	regular cache with 1 pool	ccache
3575ms	1855ms	1811ms	1699ms

That means if I limit the QPS, not running as fast as possible, the regular cache for negative items is not far from ccache. If running stress test, there is a big difference between regular cache and ccache for negative items. Can it be related to the difference of handling multi-threading between ccache and regular cache?

Do you have any suggestions on the choices of cache for negative items, where storing keys is enough. If using regular cache, what's the guidance of creating pools?

Aug 02 '22 21:08 wenhaocs

(1) The put latency of ccache is high compared with regular cache

Yeah that's right. CCache put involves us re-writing the entire bucket (which contains several items) when evictions occur. You can tune this by changing the bucket size, but it is expected to be more expensive than evicting a regular item.

Compact cache's read performance is good only for small items because it performs copy-on-read. Object size should typically be less than a cacheline (128bytes)

(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache.

I do not really understand why this is. Let's debug this some more.

In cachebench tests, when I run a miss-only workload, I typically see higher throughput than workloads with high hit rate. Cachebench runs its operations synchronously so higher throughput typically also mean lower latency.

You can try it out by running cachebench with: https://github.com/facebook/CacheLib/blob/main/cachelib/cachebench/test_configs/throughput/miss-workload/ram_cache_get_throughput_all_misses.json

In general, we recommend compact cache for negative caching because it is much more space efficient. This is assuming that your negative cache will need to cache a much larger working set than a non-negative cache. (E.g. there're more "do-not-exists" in your system than "exists") Does this pattern sound like your usage?

Aug 02 '22 22:08 therealgymmy

Exactly, in our benchmark, "not-exists" can be 8x more than "exists".

Aug 02 '22 23:08 wenhaocs

@wenhaocs: Did you resolve this issue?

Aug 08 '22 17:08 therealgymmy

Yes. It can be closed for now. I will dig this further in the future.

Aug 08 '22 19:08 wenhaocs

CacheLib CacheLib copied to clipboard

Cache performance is low when num items is high

CacheLib
CacheLib copied to clipboard