rocksdb High Memory Usage/ LRU cache size is not being respected

High Memory Usage/ LRU cache size is not being respected

Open zaidoon1 opened this issue 3 months ago • 33 comments

I've set the LRU cache to 1.5gb for the "url" cf. However, all of the sudden, the service that runs rocksdb hit the max memory limit I allocated for the service and I can see that the LRU cache for the "url" cf hit that limit:

This also caused the service to max out the cpu usage (likely because of back pressure).

flamegraph:

Apr 24 '24 18:04 zaidoon1

at least in the above instance, I have something "obvious" to blame, this happened again on a different machine. Except, this time, it doesn't look like we exceeded the configured LRU cache size by much and yet, rocksdb still used up all memory:

Apr 25 '24 08:04 zaidoon1

@ajkr any idea what could have happened here in both cases? I guess the easiest one to answer is how/why rocksdb went above the allocated LRU cache size? Unfortunately, I don't have any other LOGs to share because of the issues described here: https://github.com/facebook/rocksdb/issues/12584 (nothing showed up in the WARN level logs so I don't know what was happening at the time)

Apr 29 '24 07:04 zaidoon1

I was thinking of using strict LRU capacity but it looks like reads (and writes?) will fail if the capacity is hit which is not expected. Why don't we evict from cache instead of failing new reads?

May 01 '24 15:05 zaidoon1

Here is more data:

looks like it happens when we have lots of tombstones. This appears to match what was happening in https://github.com/facebook/rocksdb/issues/2952 although the issue there was due to some compaction bug. I'm wondering if there is another compaction bug at play here.

May 02 '24 14:05 zaidoon1

What allocator are you using? RocksDB tends to perform poorly with glibc malloc, and better with allocators like jemalloc, which is what we use internally. Reference: https://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html

Why don't we evict from cache instead of failing new reads?

We evict from cache as long as we can find clean, unpinned entries to evict. Block cache only contains dirty entries when WriteBufferManager is configured to charge its memory to block cache, and only contains pinned entries according to the BlockBasedTableOptions.

That said, we try to evict from cache even if you don't set strict LRU capacity. That setting is to let you choose the behavior in cases where there is nothing evictable - fail the operation (strict), or allocate more memory (non-strict).

May 02 '24 20:05 ajkr

What allocator are you using

I'm using jemalloc for the allocator (i've double checked this).

In the last instance this happened (screenshot above), block cache was not maxing out beyond what is configured so I don't think that's the issue. I started seeing this issue happen when I enabled the part of the system that does the "is there an index that matches prefix x check" which is prefix seek that only looks at the first kv returned. From the last graph i posted, it also appears to happen when there is a lot of tombstones so the seek + tombstones combination is very odd/suspect to me (similar to the problem reported in the rocksdb ticket i linked to). Right now, i'm doing a load test, so I'm sending 5K requests with unique prefixes and the prefixes are guaranteed to not finding any matching kv

May 02 '24 20:05 zaidoon1

Thanks for the details. Are the 5K requests in parallel? Does your memory budget allow indexes to be pinned in block cache (unpartitioned_pinning = PinningTier::kAll, or less preferably but still good, unpartitioned_pinning = PinningTier::kFlushedAndSimilar)? From the CPU profile it shows most of the work is decompressing index blocks, and that work might even be redundant in which case the memory consumed by index data would be amplified.

May 03 '24 01:05 ajkr

Are the 5K requests in parallel? yes they are
I've enabled cache_index_and_filter_blocks, everything else is whatever the default is (I'll need to check the default for unpartitioned_pinning)
It just happened again and here is what it looked like

and here is a pprof of what is happening to make it easier to see what rocksdb is doing

May 03 '24 01:05 zaidoon1

Also here is the db options I have configured: db_options.txt

May 03 '24 04:05 zaidoon1

and that work might even be redundant in which case the memory consumed by index data would be amplified.

For more details on this problem, see the stats added in #6681. It looks like you have statistics enabled so you might be able to check those stats to confirm or rule out whether that is the problem.

If it is the problem, unfortunately I don't think we have a good solution yet. That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing. Changing cache_index_and_filter_blocks to false could have a similar effect.

May 03 '24 06:05 ajkr

ok this is good to know, i'll definitely investigate this part. I would like to confirm, If we assume that's the problem, then my options are:

set cache_index_and_filter_blocks to false
keep cache_inex_and_filter_blocks set to true and also set unpartitioned_pinning = PinningTier::kAll? Or should I set cache_inex_and_filter_blocks to false and set unpartitioned_pinning?

I think the main reason I set cache_index_and_filter_blocks to true is to cap/control memory usage (but also that's when i thought i had jemalloc enabled but it wasn't so my issues at the time could be different).

That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

Is it any different/better to use WriteBufferManager to control memory usage vs cache_index_and_filter_blocks ?

May 03 '24 06:05 zaidoon1

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

There is a property: TableProperties::index_size. It's available via DB APIs like GetPropertiesOfAllTables(), or externally to the DB via sst_dump on the DB dir. It isn't exactly the same as the memory cost of holding an index block in memory but I think it should give an upper bound.

May 04 '24 00:05 ajkr

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

May 04 '24 04:05 zaidoon1

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

It is preferable if you want to use our block cache capacity setting for limiting RocksDB's total memory usage.

cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll: index block memory usage counts towards block cache capacity. Pinning prevents potential thrashing.
cache_index_and_filter_blocks=false: Index block memory usage counts toward table reader memory, which is not counted towards block cache capacity by default. Potential thrashing is still prevented because they are preloaded and non-evictable in table reader memory.

May 04 '24 19:05 ajkr

great! Thanks for confirming, once the c api changes land I'll experiment with this and report back

May 04 '24 19:05 zaidoon1

A few other questions that just came to my mind:

right now, i'm using a prefix extractor with a prefix bloom (ribbon filter, 10.0 ratio). The 5K prefix lookups per second is for kvs that don't exist, even the prefix wouldn't exist. I expect the ribbon filter to detect this and therefore rocksdb just skips doing any work. Given this, would we still have the thrashing issue?
would https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html help in my case?

May 05 '24 01:05 zaidoon1

Yes, prefix filter should prevent thrashing for index block lookups. I didn't notice earlier that it's already enabled. Then, it's surprising that BinarySearchIndexReader::NewIterator() is consuming most of the CPU. Do you set ReadOptions::total_order_seek for the iterator? That can defeat the prefix filter optimizations.

May 06 '24 23:05 ajkr

Do you set ReadOptions::total_order_seek for the iterator

I don't, unless that is set by default in rocksdb under the hood? In the rust library, I call prefix_iterator_cf which just sets prefix_same_as_start and then does the lookup: https://github.com/zaidoon1/rust-rocksdb/blob/567825463480b75f733b73f01c1bd05990aea5b9/src/db.rs#L1438-L1446

Maybe I should start by looking at the ribbon filter metrics? is there a specific metric I should be looking at to see if things are working as they should?

May 07 '24 02:05 zaidoon1

I found the following:

rocksdb.bloom.filter.useful COUNT
rocksdb.bloom.filter.full.positive COUNT
rocksdb.bloom.filter.full.true.positive COUNT
rocksdb.bloom.filter.prefix.checked COUNT
rocksdb.bloom.filter.prefix.useful COUNT
rocksdb.bloom.filter.prefix.true.positive COUNT

I couldn't find anything specific to ribbon filter so my guess is "bloom" filter would also be populated for ribbon fiter, if so, which would be the most useful for me to add a metric for to track this issue?

or maybe seek stats: https://github.com/facebook/rocksdb/blob/36ab251c07f9feaafaecf62de854283e0c580619/include/rocksdb/statistics.h#L457-L481 ? Not sure which would help me figure out what i need

May 07 '24 02:05 zaidoon1

Looks like the *LEVEL_SEEK* statistics are for iterators. The *FILTERED vs. *DATA can tell you the filtering rate. If you have a lot of other iterations happening, it could be hard to attribute the metric values to prefix checks vs. other things. Though if *LEVEL_SEEK*FILTERED stats are zero, that'd tell us a lot.

If you want to measure an operation's stats in isolation, we have PerfContext instead of Statistics for that: https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context

May 08 '24 00:05 ajkr

something that I'm wanting to make sure of, is the *LEVEL_SEEK* statistics a db level stats or is it per cf? As far as iterators go, I only use them when doing the prefix check. Other operations use multiget

May 08 '24 02:05 zaidoon1

It's as wide a scope as the Statistics object, which is at minimum one DB since it's configured in DBOptions::statistics. It could be multiple DBs if you configure multiple DBs to use the same object.

May 08 '24 02:05 ajkr

great! thanks for confirming, I'm going to track:

LAST_LEVEL_SEEK_FILTERED,
LAST_LEVEL_SEEK_DATA,
NON_LAST_LEVEL_SEEK_FILTERED,
NON_LAST_LEVEL_SEEK_DATA,

and will report back what I see

May 08 '24 05:05 zaidoon1

rocksdb rocksdb copied to clipboard

High Memory Usage/ LRU cache size is not being respected

rocksdb
rocksdb copied to clipboard