rocksdb BlobDB Caching

I want to use this git issue to track each task for BlobDB Caching since we plan to split each task into multiple PRs to make code review more straightforward and explicit.

Integrate caching into the blob read logic

In contrast with block-based tables, which can utilize RocksDB's block cache (see https://github.com/facebook/rocksdb/wiki/Block-Cache), there is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

[x] #10155
[x] #10178
[x] #10225
[x] #10198

Clean up Version::MultiGetBlob() and move 'blob'-related code snippets into MultiGetBlob. Also, add a new API in BlobSource. More context from: https://github.com/facebook/rocksdb/pull/10225.

[x] PR #10272 @riversand963 @ltamasi

- Version::MultiGetBlob(...) // multiple files multiple blobs
  -> BlobSource::MultiGetBlob()  // multiple files multiple blobs
    -> BlobSource::MultiGetBlobFromOneFile() // one file, multiple blobs

By definition, BlobSource also has information about multiple blob files, thus we can push the logic into this layer.

Add the blob cache to the stress tests and the benchmarking tool

In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool db_stress and our continuously running crash test script db_crashtest.py, as well as our synthetic benchmarking tool db_bench and the BlobDB performance testing script run_blob_bench.sh. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

[x] #10202

Add blob cache tickers, perf context statistics, and DB properties

In order to be able to monitor the performance of the new blob cache, we made the follow changes:

Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.
[x] #10203

Charge blob cache usage against the global memory limit

To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

[x] #10321
[x] #10206

Eliminate the copying of blobs when serving reads from the cache

The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target PinnableSlice. (Note: this relies on the Cleanable interface, which is implemented by PinnableSlice.) This has the potential to save a lot of CPU, especially with large blob values.

[x] #10297

Support prepopulating/warming the blob cache

Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

[x] #10298

Add a blob-specific cache priority

RocksDB's Cache abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

[x] #10309

Support using secondary cache with the blob cache

RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used.

[x] #10349

Support an improved/global limit on BlobDB's space amp

BlobDB currently supports limiting space amplification via the configuration option blob_garbage_collection_force_threshold. It works by computing the ratio of garbage (i.e. garbage bytes divided by total bytes) over the oldest batch of blob files, and if the ratio exceeds the specified threshold, it triggers a special type of compaction targeting the SST files that point to the blob files in question. (There is a coarse mapping between SSTs and blob files, which we track in the MANIFEST.)

This existing option can be difficult to use or tune. There are (at least) two challenges:

(1). The occupancy of blob files is not uniform: older blob files tend to have more garbage, so if a service owner has a specific space amp goal, it is far from obvious what value they should set for blob_garbage_collection_force_threshold. (2). BlobDB keeps track of the exact amount of garbage in blob files, which enables us to compute the blob files' "space amp" precisely. Even though it's an exact value, there is a disconnect between this metric and people's expectations regarding space amp. The problem is that while people tend to think of LSM tree space amp as the ratio between the total size of the DB and the total size of the live/current KVs, for the purposes of blob space amp, a blob is only considered garbage once the corresponding blob reference has already been compacted out from the LSM tree. (One could say the the LSM tree space amp notion described above is "logical", while the blob one is "physical".)

To make the users' lives easier and solve (1), we would want to add a new configuration option (working title: blob_garbage_collection_space_amp_limit) that would enable customers to directly set a space amp target (as opposed to a per-blob-file-batch garbage threshold). To bridge the gap between the above notion of LSM tree space amp and the blob space amp (2), we would want this limit to apply to the entire data structure/database (the LSM tree plus the blob files). Note that this will necessarily be an estimate, since we don't know exactly how much space the obsolete KVs take up in the LSM tree. One simple idea would be to take the reciprocal of the LSM tree space amp estimated using the method of VersionStorageInfo::EstimateLiveDataSize, and scale the number of live blob bytes using the same factor.

Example: let's say the LSM tree space amp is 1.5, which means that the live KVs take up two thirds of the LSM. Then, we can use the same 2/3 factor to multiply the value of (total blob bytes - garbage blob bytes) to get an estimate of the live blob bytes from the user's perspective.

Note: if the above limit is breached, we would still want to do the same thing as in the case of blob_garbage_collection_force_threshold, i.e. force-compact the SSTs pointing to the oldest blob files (potentially repeatedly, until the limit is satisfied).

[ ] #10399

Jun 13 '22 19:06 gangliao

Potential Bug

[ ] double check the lifetime of db_impl::db_id_. https://github.com/facebook/rocksdb/pull/10198
[x] BlobDB in crash test hitting assertion #10248

Jun 21 '22 19:06 gangliao

Is it planned to support the blob cache option in rocksdbjni?

Jul 11 '22 21:07 cavallium

@cavallium Currently, we have a MVP now. we will support it soon.

Jul 11 '22 21:07 gangliao

Thanks so much for implementing this feature @gangliao !

Jan 04 '23 22:01 ltamasi

Thank you for your mentorship. :)))

Jan 04 '23 23:01 gangliao

rocksdb rocksdb copied to clipboard

BlobDB Caching

Integrate caching into the blob read logic

Add the blob cache to the stress tests and the benchmarking tool

Add blob cache tickers, perf context statistics, and DB properties

Charge blob cache usage against the global memory limit

Eliminate the copying of blobs when serving reads from the cache

Support prepopulating/warming the blob cache

Add a blob-specific cache priority

Support using secondary cache with the blob cache

Support an improved/global limit on BlobDB's space amp

Potential Bug

rocksdb
rocksdb copied to clipboard