flow-go [Storage] Memory caches using concatenated IDs may contain items deleted from databases

Updates #7242

Currently, some memory caches that use concatenated IDs, etc., as keys can contain items that were already deleted from the database.

This problem is fixed in other memory caches (they use delete-by-key instead of delete-by-prefix):

PR #7324
delete-by-prefix was also fixed by PR 7324 but is commented out to avoid blocking PR (in case it is too slow and in case optimization discussions take longer than expected)

On Broadwell Xeon 2.2Ghz, deleting 5000 keys from a golang-lru cache with 10000 items takes:

~0.88 ms for delete-by-key (using 5000 keys for exact match deletion)
~1.63 ms for delete-by-prefix using 1000 prefixes (5 keys per prefix)
- ~0.88 ms (delete 5K items) + ~0.75 ms (overhead of using prefixes)

NOTE: The 1.63 ms duration includes the unavoidable 0.88 ms with golang-lru (see below). So if we optimize use of golang-lru, then we are optimizing the 0.75 ms portion of 1.63 ms and introducing tradeoffs.

If 1.63 ms is fast enough, then we can just use RemoveFunc approach (already implemented in PR #7324 but currently commented out). It doesn't sacrifice speed of existing golang-lru functions to speedup the less frequent delete-by-prefix operation.

If 1.63 ms is too slow, then see below for more details and considerations for optimizing the 0.75 ms portion of the 1.63 ms for delete-by-prefix.

Either way, we should probably run more benchmarks and at least identify reasonable upper bound(s) we should be enforcing on the cache item limit configuration flag(s). Some flags shouldn't be unbounded.

Delete-by-key vs Delete-by-prefix

UPDATE: I ran benchmarks on Sunday, May 11, 2025 (to prep for May 12 meeting, not to optimize) and added this section.

If we already have a set of 5000 keys to delete, then removing 5000 items by key from a golang-lru cache containing 10000 items takes about 0.9 ms on Broadwell Xeon 2.2Ghz.

Sometimes we only have prefixes instead of specific keys to delete. For example, we may have 1000 prefixes (e.g., block ids) during pruning and each prefix might match an average of about 4-5 keys we want to delete.

Given this, deleting 5000 items by prefix from unmodified golang-lru containing 10000 items will require more than 0.9ms. Benchmark code:

for range b.N
    // Setup and start timer:
    // - Create 10K-item cache.
    // - From 1000 block ids (prefixes), create slice of 5K keys.
    // ...
    b.StartTimer()

    // On Broadwell Xeon 2.2Ghz, this part takes about 0.9 ms, so:
    //   ???.? ms to gather 5K keys from 1000 prefixes
    //     0.9 ms to Remove() 5K keys (remove by exact match)
    //   =========
    //   ???.? ms because gathering 5K keys from 1000 block ids takes time

    for _, id := range removeIDs {
        cache.Remove(id)  // remove key by exact match
    } 
}

Using the unoptimized RemoveFunc feature I added to golang-lru takes about 1.6 ms for delete-by-prefix:

10000 items in the cache
1000 prefixes (block ids)
5000 items deleted (5 items matched per prefix)
~1.6 ms duration on Broadwell Xeon 2.2Ghz
- ~0.9 ms of this is unavoidable with golang-lru deleting 5K items from 10K cache
- unoptimized RemoveFunc required no modifications to existing golang-lru functions

Using golang-lru RemoveFunc() iterates just once over the cache, gets the prefix from each item's key, and looks up each prefix in a map of prefixes to remove. Benchmark code:

for range b.N
    // Create 10K-item cache.
    // NOTE: We avoid gathering 5K keys from 1000 prefixes (block ids).
    // ...
    b.StartTimer()

    cache.RemoveFunc(func(key string) bool {
        keyPrefix := key[:64]
        return removePrefixes[keyPrefix]  // item is removed from cache if true
     })
     
     // On Broadwell Xeon 2.2Ghz, this takes about:
     //  ~0.00 ms very fast to gather prefixes (we already have block ids).
     //   1.63 ms to remove-by-prefix 5K keys.
     //   =======
     //   1.63 ms
}

See fxamacker/golang-lru for implementation details about RemoveFunc().

Next Steps

We can either:

leave deleted items in the cache for some specific senarios (see Notes section below), or
use RemoveFunc() if it is fast enough for delete-by-prefix during database pruning, etc.
invest time to look into optimizing delete-by-prefix (See Notes section below)

Notes

For memory caches using concatenated IDs as keys, there are multiple approaches to evaluate. They include (in no particular order):

A. Leaving deleted items in the cache and relying on a different layer of code to prevent calls to get items from the cache if some criteria isn't met (e.g., items below the pruned height as suggested by Leo). However, leaving deleted items in the cache can invite future bugs, especially as different layers of code (e.g., storage vs non-storage layer, etc.) get maintained by different people. Also, leaving garbage in the LRU cache can increase cache misses, cause non-garbage items to be evicted earlier than necessary, and wastes memory.

B. Removing items by prefix (block id) from unmodified golang-lru cache is too slow. Using a minimally modified golang-lru cache to provide RemoveFunc is still too slow (because it avoids further optimizations that require changing existing golang-lru functions). Batching prefixes can make using RemoveFunc faster. However, this approach is still too slow (if 1.6 ms is too slow for removing 5K items from 10K golang-lru by using 1K prefixes), especially if cache item limit is 10K by default and is configurable by flag.

C. Removing items by key (e.g., block id + transaction id) can be faster if we have the list of keys to delete from cache (rather than a list of prefixes) but there are tradeoffs. E.g., using the most efficient built-in way to delete records from Pebble doesn't provide list of deleted keys, so the list of deleted keys need to be retrieved by prefix using db.Iterator, etc.

D. Changing the cache key from the concatenation of IDs to prefix. This approach doesn't require modifying golang-lru. However, this approach can slow down retrieval, which is more frequent than removal.

E. Maintaining a prefix index for cached records. This index map can be used to remove keys by exact match, which is faster than removing by prefix. However, this approach adds overhead to nearly all operations that modifies the cache: reading, inserting, deleting records from database and also evicting items from the cache. This adds memory overhead linear to cache size limit, but that should be OK for a 10K-item cache. To be efficient, this may or may not benefit from direct modifications to golang-lru cache (look into this as part of evaluation).

F. Replace golang-lru cache with an existing production-quality cache that is optimized for fast delete-by-prefix. A quick search didn't find one (look into this as part of evaluation).

G. Maintaining an index of removed prefixes. The index can be used to filter out deleted items before retrieving from the cache. This approach can small overhead to removal and retrieval. It also adds some memory overhead than can be pruned periodically to avoid unbound growth. To be efficient, this may or may not benefit from direct modifications to golang-lru cache (look into this as part of evaluation).

May 07 '25 14:05 fxamacker

@j1010001 @janezpodhostnik @peterargue @zhangchiqing

Good meeting today (sorry it was ~30 minutes longer than expected).

NEXT STEPS: I will use Approach G to create a PoC. Approach G combines some aspects of approaches A & B:

Like approach A, PoC will not immediately remove items from golang-lru.
Like approach B, PoC will use a removed-prefix map.
Calls to retrieve item from golang-lru will check the new map (adds overhead of map lookup).
When removed-prefix map reaches a size limit, we prune the map and the golang-lru cache.

May 21 '25 19:05 fxamacker

still needs fox for cache using concatenated key

Jun 05 '25 16:06 j1010001

Quick update on this, since today's backlog meeting was canceled and we had no standup yesterday.

I resumed work on this yesterday (July 9). Also very briefly started to resume this last week but Jan asked for help with technical assessments, so I finished those instead (Thursday and Monday).

After I wrap this up, I will work on these next (unless something more urgent comes up, etc.):

flow-go-internal#6293
https://github.com/onflow/flow-go/issues/7573

Jul 10 '25 18:07 fxamacker

Update: I created a new LRU using a different approach than discussed because the approach discussed on May 21 would require more locks and it would add overhead to get, set, etc.

For now, the new LRU is named lru.GroupCache. For reasons I can explain later, I'm going to refer to delete-by-prefix as delete-by-group.

The new design retains the fast speed of get while also allowing the delete-by-group performance to be relatively stable even as number of total items in the cache grows. The trade-off is to use some extra memory (cached keys).

Basically, we don't need to use the same LRU for everything:

lru.Cache can be used by components that don't need the delete-by-group feature
lru.GroupCache is for use cases that need to delete a group of keys (by grouping keys by prefix or any other criteria)

Timeline:

July 9: It looked like the approach discussed on May 21 would involve more locks and add overhead to get, set, etc.
July 10-11: I created a new LRU with a different design and named it lru.GroupCache to avoid confusion with existing lru.Cache in the same package.
July 14: I didn't spend time on this. I reviewed database PRs/feedback, back-to-back meetings, (no lunch until 3 pm), etc.
July 15: Added more benchmarks for delete-by-group performance, started tidying up modified code, etc.

I wrapped up late last night (~8:45 PM Central) so today I will double-check benchmark code & results, maybe rename some things while benchmarks run, etc. and will begin opening PRs.

Jul 16 '25 13:07 fxamacker