quick-cache
quick-cache copied to clipboard
Eviction with hierarchical keys
:wave: It's the LSM-tree guy again.
I've seen that QC supports a clear method, however I find that pretty destructive for my use case. As is typical with an LSM-tree, sometimes some parts of the database may be nuked after compaction, so a bunch of entries in the cache will be useless. Of course, they will get replaced over time; but only if the cache limit is actually eventually reached.
Consider a cache capacity of 1GB and a database of size 500MB. As data is written and compacted down the levels, the cache will fill up, to let's say 500 MB. Now, when a compaction creates new segments further down the tree and nukes the old segments, the cache capacity is not reduced, but the new segments will take more cache size, surpassing 500 MB, up to 1 GB. This makes it impossible to configure the cache optimistically. If 2 GB are stored, and the cache capacity is 8 GB, over time the cache is likely to take 8 GB, so 6 GB of the cache will be dead data nobody will ever read again. Especially, in a key-value separated tree, the index tree is pretty small (e.g. 30 MB), but setting the cache to something like 50 MB may require manual monitoring and adjusting of the cache size, when eventually it grows past 50 MB. And setting it to 1 GB will eat away the memory, even though just 30-50 MB would be needed. Even being pessimistic and setting it to 200 MB would use 150MB more than needed.
The block cache uses weighted entries and, more importantly, hierarchical keys that look like this: (tree id) -> (segment id) -> (offset)
So if segment 4 in tree 0 is deleted, I would like to evict all cache entries that match (0) -> (4) -> *
Could QC support a retain() method that allows partially clearing the cache? This process doesn't happen too often and happens asynchronously, but I can imagine for a huge GB-sized cache it can still have quite some overhead, as I understand it will have to go through all shards. Would it be advantageous to adjust the hash to partition a segment's entries to the same shard (so just hashing [tree ID, segment ID], instead of [tree ID, segment ID, offset]) - that would cut down the cleanup process to 1 shard only, but it would need some API so the retain is concentrated on just 1 shard (it would need the hash...).
Does this make sense?
Edit: I guess this probably doesn't work because the same hash is used for the HashBrown table... there would need to be 2 hashes, a partitioning/shard hash and the item hash...?