rocksdb
rocksdb copied to clipboard
Add a blob-specific cache priority
Summary:
RocksDB's Cache
abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
The previous PR (#10309) was reverted by #10434 because it caused regression tests to fail (degraded performance). Today we reproduced the potential bug and successfully fixed the bug by comparing the performance gap of the three versions (#10451), this PR contains the latest implementation.
The performance test results of three different binaries are provided by @mdcallag, more details could be found in [bug, revert, fix]
@pdillinger @ltamasi @gitbw95 @riversand963
I don't think we should introduce small passive performance regressions (affecting performance when new feature is not enabled) for features that haven't been proven necessary.
One possibility that I'm not in love with but might be a decent compromise would be to add two bool template parameters to LRUCache for whether certain priority pools exist. This way the appropriately optimized implementation can be used when those pools don't (need to) exist. This code is hot enough that I think we can accept code size increase.
I don't think we should introduce small passive performance regressions (affecting performance when new feature is not enabled) for features that haven't been proven necessary.
One possibility that I'm not in love with but might be a decent compromise would be to add two bool template parameters to LRUCache for whether certain priority pools exist. This way the appropriately optimized implementation can be used when those pools don't (need to) exist. This code is hot enough that I think we can accept code size increase.
I'd say let's first establish whether there is actually a performance impact for the common case with the updated code. This is still in progress. (The earlier patch did have a bug that we have since fixed.) Personally, the change doesn't strike me as something that would introduce a significant overhead; then again, I might be turn out be wrong on this one, in which case we can obviously reconsider.
Personally, the change doesn't strike me as something that would introduce a significant overhead; then again, I might be turn out be wrong on this one, in which case we can obviously reconsider.
This is among the hottest of hot code, and this change introduces strictly more tracking to be done regardless of enabling the feature. Just look at LRU_Remove code, which alone is 1% of total CPU (nearly 2% of RocksDB CPU) of a large workload looking for CPU improvements and 0.3% of a very large workload. (All of LRUCache is roughly 9% and 2.5% of those workloads.)
This is among the hottest of hot code, and this change introduces strictly more tracking to be done regardless of enabling the feature. Just look at LRU_Remove code, which alone is 1% of total CPU (nearly 2% of RocksDB CPU) of a large workload looking for CPU improvements and 0.3% of a very large workload. (All of LRUCache is roughly 9% and 2.5% of those workloads.)
I know it's hot - what I'm saying is that I don't expect the cost to be noticeable if the feature is not enabled (low_pri_pool_ratio==0
, which is the default) especially when FDO is used.
I will share more benchmark results later today.
The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?
The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?
Our goal is to prioritize SST data blocks over blobs, since with BlobDB data blocks function sort of function as an index. So when those scans happen, we would prefer to keep data blocks in the cache at the expense of blobs.
My test shows an unexplained small improvement (~2% ops/sec on a benchmark with 50% CPU in LRUCache) with the feature present but disabled, and ~9% overhead with the feature enabled. Assuming Mark's tests come out OK, that satisfies me on the performance front.
The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?
Our goal is to prioritize SST data blocks over blobs, since with BlobDB data blocks function sort of function as an index. So when those scans happen, we would prefer to keep data blocks in the cache at the expense of blobs.
The answer to my question is yes, or no?
The answer to my question is yes, or no?
Sorry I'm not sure I understand your question. If you have all three pools enabled (i.e. high ratio > 0, low ratio > 0), metablocks will go in the high-pri pool, data blocks will go in the low-pri pool, and blobs will go in the bottom-pri pool. So if we have a full scan, blobs will see more thrashing than data blocks, which in turn will see more thrashing than metablocks (which is I think what we want).
I have results from 3 benchmarks. I will summarize them here, then add 2 more comments with more detail per benchmark.
Two binaries were compared, I assume these are close in the git log
- fix - db_bench with the proposed fix, git hash e69183be42, this is the PR
- revert - db_bench with the original change reverted, git hash d976f6897, this is RocksDB without the PR
The test server has 256G RAM, 40 cores, 80 HW threads and hyperthreads are enabled.
The benchmarks are:
- viewstate - from tools/regression_test.sh, 335G database, uses buffered IO for user reads, 16 client threads
- my benchmark.sh, by RX - database is cached by RocksDB
- my benchmark.sh, by OS - database is cached by the OS, 1G RocksDB block cache
- all use leveled compaction, BlobDB was not evaluated here
For my benchmark.sh
- I mean using the wrapper scripts I normally use. The database has 40M KV pairs, 20-byte keys, 400 byte values. Compression was not used. I assume but did not confirm that compression is used for viewstate
- tests were repeated for 4 and 16 client threads
Everything was run 3 times and I focus on the median values for QPS and CPU/query
I expect by RX at 4 threads to be the most likely to show a regression. The summary is ...
For by OS:
- qps for fix is 0.57% smaller at 4 threads, 0.26% smaller at 16 threads
- cpuM/q (cpu/query) for fix is 0.42% larger at 4 threads, 0.18% larger at 16 threads
For by RX:
- qps for fix is ~3.5% larger at 4 threads, ~2.0% larger at 16 threads
- cpuM/q (cpu/query) for fix is ~3.4% smaller at 4 threads, ~1.9% smaller at 16 threads
For viewstate:
- qps for fix is ~0.29% smaller
- cpuM/q (cpu/query) for fix is ~0.21% larger
More detail for by RX and by OS
Using tools/benchmark.sh with 40M KB/pairs, 20-byte key, 400-byte values and no compression. This was run for 4 and 16 client threads. Finally, each benchmark was repeated 3 times and I focus on the median for QPS and CPU overhead.
There were 2 configs:
- by OS - 1G RocksDB block cache, all data in OS page cache
- by RX - data fits in RocksDB block cache
For "by OS":
- qps for fix is 0.57% smaller at 4 threads, 0.26% smaller at 16 threads
- cpuM/q (cpu/query) for fix is 0.42% larger at 4 threads, 0.18% larger at 16 threads
For "by RX":
- qps for fix is ~3.5% larger at 4 threads, ~2.0% larger at 16 threads
- cpuM/q (cpu/query) for fix is ~3.4% smaller at 4 threads, ~1.9% smaller at 16 threads
Legend:
* qps - average throughput
* user, sys - seconds of wall-clock, user CPU and system CPU time per /bin/time, wall clock seconds was ~600
* cpuM/q - (cpu/queries)*1M where cpu is user + system time from db_bench per /bin/time and queries is total number of queries
* cached? - "by RX" if cached by RocksDB, "by OS" otherwise
* threads - number of client threads
* binary - "revert" if db_bench with change reverted, "fix" if with change fixed
--- by OS results
- 4 threads
qps user sys cpuM/q threads binary
283450 1609 788 14.096 4 revert
283616 1608 789 14.088 4 fix
-
284900 1601 796 14.023 4 revert
282878 1609 788 14.125 4 fix
-
284490 1605 792 14.044 4 revert
281000 1614 783 14.218 4 fix
-
* median QPS is 284490 for revert, 282878 for fix
* relative QPS (median fix / median revert) = 0.9943 -> QPS for fix is 0.57% smaller
* median cpuM/q is 14.044 for revert, 14.125 for fix
* relative cpuM/q (median fix / median revert) = 1.0058 -> cpuM/q for fix is 0.42% larger
- 16 threads
qps user sys cpuM/q threads binary
1052129 6318 3212 15.095 16 revert
1048474 6321 3203 15.140 16 fix
-
1051213 6334 3198 15.113 16 revert
1052706 6319 3213 15.090 16 fix
-
1049348 6318 3212 15.135 16 revert
1048458 6346 3189 15.157 16 fix
-
* median QPS is 1051213 for revert, 1048474 for fix
* relative QPS (median fix / median revert) = 0.9974 -> QPS for fix is 0.26% smaller
* median cpuM/q is 15.113 for revert, 15.140 for fix
* relative cpuM/q (median fix / median revert) = 1.0018 -> cpuM/q for fix is 0.18% larger
--- by RX results
- 4 threads
qps user sys cpuM/q threads binary
458115 2331 58 8.691 4 revert
460179 2333 56 8.651 4 fix
-
456688 2326 61 8.711 4 revert
474120 2334 56 8.400 4 fix
-
466457 2333 56 8.538 4 revert
479702 2331 58 8.303 4 fix
-
* median QPS is 474120 for revert, 458115 for fix
* relative QPS (median fix / median revert) = 1.0349 -> QPS for fix is 3.49% larger
* median cpuM/q is 8.691 for revert, 8.400 for fix
* relative cpuM/q (median fix / median revert) = 0.9665 -> cpuM/q for fix is 3.35% smaller
- 16 threads
qps user sys cpuM/q threads binary
1615606 8797 586 9.680 16 revert
1659593 8789 598 9.427 16 fix
-
1602179 8817 577 9.772 16 revert
1645338 8843 561 9.526 16 fix
-
1620401 8805 586 9.659 16 revert
1647635 8804 583 9.495 16 fix
-
* median QPS is 1615606 for revert, 1647635 for fix
* relative QPS (median fix / median revert) = 1.0198 -> QPS for fix is 1.98% larger
* median cpuM/q is 9.680 for revert, 9.495 for fix
* relative cpuM/q (median fix / median revert) = 0.9809 -> cpuM/q for fix is 1.91% smaller
For viewstate
This has results from viewstate for readrandom, seekrandomwhilewwriting and readwhilewriting.
I didn't run seekrandom because it tends to have more variance that I have yet to explain.
Tests are run with 16 client threads on a host with 40 cores, 80 HW threads, HT is enabled and 256G RAM.
For the viewstate test there are 128 databases (--num_multi_db=128), the total db size is ~335G,
O_DIRECT is used for compaction, buffered IO is used for user reads and I assume the database
files are compressed.
Notes:
* qps for fix is ~0.29% smaller
* cpuM/q (cpu/query) for fix is ~0.21% larger
* storage reads/query are similar
Legend:
* qps - average throughput
* p50, p99 - response time in microseconds
* p99.9, p99.99 - response time in seconds
* real, user, sys - seconds of wall-clock, user CPU and system CPU time per /bin/time
* cpuM/q - (cpu/queries)*1M where cpu is user + system time from db_bench per /bin/time and queries is total number of queries
Result order is: revert, fix
readrandom
qps p50 p99 p99.9 p99.99 real user sys cpuM/q binary
115835 138140 376360 0.548 0.721 1801 17838 2868 99.30 revert
115595 138310 376510 0.549 0.721 1801 17823 2866 99.42 fix
-
115683 138270 376470 0.549 0.730 1801 17850 2864 99.47 revert
115485 138480 376570 0.550 0.728 1801 17839 2863 99.58 fix
-
115818 138180 376400 0.549 0.722 1801 17851 2867 99.37 revert
115444 138360 376680 0.550 0.729 1801 17843 2869 99.66 fix
-
* median QPS is 115818 for revert, 115485 for fix
* ratio of median QPS (fix/revert) = 0.9971 -> QPS for fix is ~0.29% less
* median cpuM/q is 99.37 for revert, 99.58 for fix
* ratio of median cpuM/q (fix / revert) is 1.0021 -> fix uses ~0.21% more CPU/query
seekrandomwhilewriting
qps p50 p99 p99.9 p99.99 real user sys binary
65909 230490 571870 0.812 1.028 1803 18357 2997 revert
65879 230640 571920 0.813 1.066 1803 18348 2995 fix
-
65864 230650 572060 0.814 1.063 1804 18352 2999 revert
65851 230680 572020 0.814 1.042 1803 18351 2996 fix
-
65895 230590 571960 0.813 1.064 1803 18349 2996 revert
65743 230960 572080 0.814 1.060 1803 18370 2990 fix
-
* median QPS is 65895 for revert, 65851 for fix
* ratio of median QPS (fix/revert) = 0.9993 -> QPS for fix is ~0.07% less
readwhilewriting
qps p50 p99 p99.9 p99.99 real user sys binary
113692 139610 377860 0.558 0.880 1853 17557 2994 revert
113565 139770 377810 0.558 0.869 1853 17559 3005 fix
-
113589 139700 377940 0.559 0.864 1852 17551 2990 revert
113590 139680 377870 0.558 0.857 1854 17574 2994 fix
-
113773 139600 377850 0.558 0.864 1853 17558 2995 revert
113507 139770 377890 0.558 0.865 1853 17580 2991 fix
-
* median QPS is 113692 for revert, 113565 for fix
* ratio of media QPS (fix/revert) = 0.9989 -> QPS for fix is 0.11% less
The answer to my question is yes, or no?
Sorry I'm not sure I understand your question. If you have all three pools enabled (i.e. high ratio > 0, low ratio > 0), metablocks will go in the high-pri pool, data blocks will go in the low-pri pool, and blobs will go in the bottom-pri pool. So if we have a full scan, blobs will see more thrashing than data blocks, which in turn will see more thrashing than metablocks (which is I think what we want).
The first time a data block is inserted, it goes to low-pri pool, but if it gets hit again, it is promoted to the high pri pool. On the other hand, what is evicted from the high pri pool will be inserted in to the low-pri pool. Metablock in high pri pool is just a paranoid optimization, rather than the main purpose of the two-pool approach.
Metablock in high pri pool is just a paranoid optimization, rather than the main purpose of the two-pool approach.
One data point regarding that: technically the root cause of the earlier regression was that when there was a low-pri pool configured but no high-pri one, metablocks ended up at the bottom (below data blocks) and got evicted first. This suggests to me that where we insert them in the LRU list does make a difference.
@siying @mdcallag Please let us know if you have any concerns about landing this considering the performance numbers above. Thanks!
Did a test with BlobDB to quantify the effects of the bottommost cache priority. After populating a smallish (fits in the OS page cache) DB with 1KB-sized values using fillrandom
, I compared readrandom
performance (with 16 threads) using a 1GB shared block/blob cache with two settings. Baseline was high_pri_pool_ratio=0.1
(enough to hold all the metablocks) and low_pri_pool_ratio=0.0
(default). The other configuration had the new feature enabled: high_pri_pool_ratio=0.1
(same as before) and low_pri_pool_ratio=0.4
.
I repeated the tests three times. Here are the data block cache miss counters for the baseline (note: there were no cache misses for index blocks after the initial load with either configuration):
rocksdb.block.cache.data.miss COUNT : 371737
rocksdb.block.cache.data.miss COUNT : 371568
rocksdb.block.cache.data.miss COUNT : 372248
With the low-priority pool enabled:
rocksdb.block.cache.data.miss COUNT : 99611
rocksdb.block.cache.data.miss COUNT : 99625
rocksdb.block.cache.data.miss COUNT : 99708
So data blocks saw much less churn with the low-priority pool enabled, which is what we would want since we have to go through them to get to the blobs. This also translated to higher ops/sec: baseline was 617698/633135/635381 for the three runs vs 620729/636272/643619 with the feature enabled.
Did a test with BlobDB to quantify the effects of the bottommost cache priority. After populating a smallish (fits in the OS page cache) DB with 1KB-sized values using
fillrandom
, I comparedreadrandom
performance (with 16 threads) using a 1GB shared block/blob cache with two settings. Baseline washigh_pri_pool_ratio=0.1
(enough to hold all the metablocks) andlow_pri_pool_ratio=0.0
(default). The other configuration had the new feature enabled:high_pri_pool_ratio=0.1
(same as before) andlow_pri_pool_ratio=0.4
.I repeated the tests three times. Here are the data block cache miss counters for the baseline (note: there were no cache misses for index blocks after the initial load with either configuration):
rocksdb.block.cache.data.miss COUNT : 371737 rocksdb.block.cache.data.miss COUNT : 371568 rocksdb.block.cache.data.miss COUNT : 372248
With the low-priority pool enabled:
rocksdb.block.cache.data.miss COUNT : 99611 rocksdb.block.cache.data.miss COUNT : 99625 rocksdb.block.cache.data.miss COUNT : 99708
So data blocks saw much less churn with the low-priority pool enabled, which is what we would want since we have to go through them to get to the blobs. This also translated to higher ops/sec: baseline was 617698/633135/635381 for the three runs vs 620729/636272/643619 with the feature enabled.
Good results! I assume total read I/O bandwidth also improved. Can you post the number here for a record?
Some more stats about overall cache misses as discussed with @siying :
Baseline:
rocksdb.block.cache.miss COUNT : 371740
rocksdb.blobdb.cache.miss COUNT : 111060940
rocksdb.block.cache.miss COUNT : 371571
rocksdb.blobdb.cache.miss COUNT : 111062668
rocksdb.block.cache.miss COUNT : 372251
rocksdb.blobdb.cache.miss COUNT : 111062404
With the feature enabled:
rocksdb.block.cache.miss COUNT : 99614
rocksdb.blobdb.cache.miss COUNT : 111078933
rocksdb.block.cache.miss COUNT : 99628
rocksdb.blobdb.cache.miss COUNT : 111070234
rocksdb.block.cache.miss COUNT : 99711
rocksdb.blobdb.cache.miss COUNT : 111069999
So there's fewer misses overall and also less I/O: there's a ~270,000 reduction in SST block misses in exchange for a ~10,000-20000 increase in blob misses (with 1-KB blobs which are thus smaller than SST blocks).
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@gangliao has updated the pull request. You must reimport the pull request before landing.
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@gangliao has updated the pull request. You must reimport the pull request before landing.
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@gangliao has updated the pull request. You must reimport the pull request before landing.
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hi @gangliao!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!