rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Support kFullChargeCacheMetadata for BlobDB

Open mdcallag opened this issue 1 year ago • 2 comments

This is a feature request.

Can integrated BlobDB respect the block cache size limits whether or not it shares the block cache? I started down this path because I was getting OOM during benchmarks.

When I run benchmarks with Leveled compaction and set the size of the block cache to 180G I have a shell script that does "ps" in a loop and I observe that RSS for the db_bench process doesn't exceed 183G. With BlobDB and similar benchmarks I get OOM when the block cache size is 180G. I reduced it to 120G for the tests below to learn that RSS will be ~1.4X the size of the block cache.

A similar run that uses integrated BlobDB with leveled compaction where BlobDB shares the block cache and the block cache size is 120G has RSS that is ~1.4X the size of the block cache limit. From the math below for ohead that value predicts how much larger RSS will be relative to the block cache size.

AFAIK this is with kFullChargeCacheMetadata, the default value. Possible sources of the difference are the memory for the key, the memory for the LRUHandle and the difference between the jemalloc bin from which the allocation is done and the size of the allocation. The jemalloc stats including the allocation bin sizes are here for one of the tests.

For the results below I ran overwrite for BlobDB via benchmark.sh before BlobDB was fixed to not insert to the block cache during compaction. And then I did a similar test for Leveled compaction using most of the benchmark steps because overwrite doesn't fill the block cache for it.

legend:
* bytes - size of value, key is 20 bytes
* vsz, rss - largest values for VSZ and RSS from db_bench
* bins - entries from bins: section with largest values for curregs
* ohead - (sizeof(key) + sizeof(handle) + sizeof(jemalloc roundup)) / sizeof(value)
* rss/bc - sizeof(RSS) / sizeof(block cache)
* Nrows - number of KV pairs
* jemalloc has many bins, the ones that matter here are 32, 48, 96, 448, 896, 1792 and 4096

For BlobDB with 120G block cache and sharing block cache with blobs.
Measured during overwrite which until recently has been inserting into the block cache during compaction.

bytes   vsz     rss     bins            ohead                   rss/bc  Nrows
400     205.7   173.4   32,96,448       (32+96+48)/400=0.44     1.45    6B
800     203.8   164.4   32,96,896       (32+96+96)/800=0.28     1.37    3B
1600    207.3   165.4   32,96,1792      (32+96+192)/1600=0.20   1.38    1.5B

For Leveled compaction with 180G block cache, block_size=8192, metadata_block_size=4096
Measured during readwhilewriting, prepopulate_block_cache not set
Database size is ~1TB on disk

It is interesting that curregs for the 8192 byte bin was small. Does this mean that
99%+ of the block cache is metadata?

bytes   vsz     rss     bins            ohead                   rss/bc  Nrows
400     212.7   183.0   32,48,4096      ?                       1.02    6B
800     212.3   182.6   32,48,4096                              1.01    3B
1600    212.3   183.2                                           1.02    1.5B

mdcallag avatar Aug 04 '22 23:08 mdcallag

@mdcallag Thanks for filing this.

I think we're in the clear when it comes to the cache-level metadata (size of LRUHandle + cache key), since that's handled by the LRU cache internally. What we're currently not considering for cached blobs is the overhead of the std::string object used (32 bytes), and the space wasted due to the difference between the size of the blob and the jemalloc bin. These should be easy to fix; in addition, I think we can also eliminate a copy during the insertion of blobs into the cache.

ltamasi avatar Aug 05 '22 01:08 ltamasi

Landed a patch that will hopefully fix this in 23376aa5766ff486df666896809ccac38227fbd0. With this change, the charge for blobs inserted into the cache will also take into consideration the size of the handle (metadata) object and the size of the jemalloc bin.

ltamasi avatar Aug 26 '22 23:08 ltamasi

Thanks, appears to be fixed

mdcallag avatar Oct 31 '22 16:10 mdcallag