rocksdb
rocksdb copied to clipboard
BlobDB Caching
I want to use this git issue to track each task for BlobDB Caching since we plan to split each task into multiple PRs to make code review more straightforward and explicit.
Integrate caching into the blob read logic
In contrast with block-based tables, which can utilize RocksDB's block cache (see https://github.com/facebook/rocksdb/wiki/Block-Cache), there is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
- [x] #10155
- [x] #10178
- [x] #10225
- [x] #10198
Clean up Version::MultiGetBlob()
and move 'blob'-related code snippets into MultiGetBlob
. Also, add a new API in BlobSource
. More context from: https://github.com/facebook/rocksdb/pull/10225.
- [x] PR #10272 @riversand963 @ltamasi
- Version::MultiGetBlob(...) // multiple files multiple blobs
-> BlobSource::MultiGetBlob() // multiple files multiple blobs
-> BlobSource::MultiGetBlobFromOneFile() // one file, multiple blobs
By definition, BlobSource also has information about multiple blob files, thus we can push the logic into this layer.
Add the blob cache to the stress tests and the benchmarking tool
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool db_stress
and our continuously running crash test script db_crashtest.py
, as well as our synthetic benchmarking tool db_bench
and the BlobDB performance testing script run_blob_bench.sh
. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
- [x] #10202
Add blob cache tickers, perf context statistics, and DB properties
In order to be able to monitor the performance of the new blob cache, we made the follow changes:
-
Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
-
Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
-
Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.
-
[x] #10203
Charge blob cache usage against the global memory limit
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
- [x] #10321
- [x] #10206
Eliminate the copying of blobs when serving reads from the cache
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target PinnableSlice
. (Note: this relies on the Cleanable
interface, which is implemented by PinnableSlice
.) This has the potential to save a lot of CPU, especially with large blob values.
- [x] #10297
Support prepopulating/warming the blob cache
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
- [x] #10298
Add a blob-specific cache priority
RocksDB's Cache abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
- [x] #10309
Support using secondary cache with the blob cache
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used.
- [x] #10349
Support an improved/global limit on BlobDB's space amp
BlobDB currently supports limiting space amplification via the configuration option blob_garbage_collection_force_threshold
. It works by computing the ratio of garbage (i.e. garbage bytes divided by total bytes) over the oldest batch of blob files, and if the ratio exceeds the specified threshold, it triggers a special type of compaction targeting the SST files that point to the blob files in question. (There is a coarse mapping between SSTs and blob files, which we track in the MANIFEST.)
This existing option can be difficult to use or tune. There are (at least) two challenges:
(1). The occupancy of blob files is not uniform: older blob files tend to have more garbage, so if a service owner has a specific space amp goal, it is far from obvious what value they should set for blob_garbage_collection_force_threshold
.
(2). BlobDB keeps track of the exact amount of garbage in blob files, which enables us to compute the blob files' "space amp" precisely. Even though it's an exact value, there is a disconnect between this metric and people's expectations regarding space amp. The problem is that while people tend to think of LSM tree space amp as the ratio between the total size of the DB and the total size of the live/current KVs, for the purposes of blob space amp, a blob is only considered garbage once the corresponding blob reference has already been compacted out from the LSM tree. (One could say the the LSM tree space amp notion described above is "logical", while the blob one is "physical".)
To make the users' lives easier and solve (1), we would want to add a new configuration option (working title: blob_garbage_collection_space_amp_limit
) that would enable customers to directly set a space amp target (as opposed to a per-blob-file-batch garbage threshold). To bridge the gap between the above notion of LSM tree space amp and the blob space amp (2), we would want this limit to apply to the entire data structure/database (the LSM tree plus the blob files). Note that this will necessarily be an estimate, since we don't know exactly how much space the obsolete KVs take up in the LSM tree. One simple idea would be to take the reciprocal of the LSM tree space amp estimated using the method of VersionStorageInfo::EstimateLiveDataSize
, and scale the number of live blob bytes using the same factor.
Example: let's say the LSM tree space amp is 1.5, which means that the live KVs take up two thirds of the LSM. Then, we can use the same 2/3 factor to multiply the value of (total blob bytes - garbage blob bytes) to get an estimate of the live blob bytes from the user's perspective.
Note: if the above limit is breached, we would still want to do the same thing as in the case of blob_garbage_collection_force_threshold
, i.e. force-compact the SSTs pointing to the oldest blob files (potentially repeatedly, until the limit is satisfied).
- [ ] #10399
Potential Bug
- [ ] double check the lifetime of db_impl::db_id_. https://github.com/facebook/rocksdb/pull/10198
- [x] BlobDB in crash test hitting assertion #10248
Is it planned to support the blob cache option in rocksdbjni?
@cavallium Currently, we have a MVP now. we will support it soon.
Thanks so much for implementing this feature @gangliao !
Thank you for your mentorship. :)))