Summary: RocksDB's Cache abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Aug 02 '22 04:08 gangliao

The previous PR (#10309) was reverted by #10434 because it caused regression tests to fail (degraded performance). Today we reproduced the potential bug and successfully fixed the bug by comparing the performance gap of the three versions (#10451), this PR contains the latest implementation.

The performance test results of three different binaries are provided by @mdcallag, more details could be found in [bug, revert, fix]

@pdillinger @ltamasi @gitbw95 @riversand963

Aug 02 '22 04:08 gangliao

I don't think we should introduce small passive performance regressions (affecting performance when new feature is not enabled) for features that haven't been proven necessary.

One possibility that I'm not in love with but might be a decent compromise would be to add two bool template parameters to LRUCache for whether certain priority pools exist. This way the appropriately optimized implementation can be used when those pools don't (need to) exist. This code is hot enough that I think we can accept code size increase.

Aug 03 '22 23:08 pdillinger

I don't think we should introduce small passive performance regressions (affecting performance when new feature is not enabled) for features that haven't been proven necessary.

One possibility that I'm not in love with but might be a decent compromise would be to add two bool template parameters to LRUCache for whether certain priority pools exist. This way the appropriately optimized implementation can be used when those pools don't (need to) exist. This code is hot enough that I think we can accept code size increase.

I'd say let's first establish whether there is actually a performance impact for the common case with the updated code. This is still in progress. (The earlier patch did have a bug that we have since fixed.) Personally, the change doesn't strike me as something that would introduce a significant overhead; then again, I might be turn out be wrong on this one, in which case we can obviously reconsider.

Aug 04 '22 00:08 ltamasi

Personally, the change doesn't strike me as something that would introduce a significant overhead; then again, I might be turn out be wrong on this one, in which case we can obviously reconsider.

This is among the hottest of hot code, and this change introduces strictly more tracking to be done regardless of enabling the feature. Just look at LRU_Remove code, which alone is 1% of total CPU (nearly 2% of RocksDB CPU) of a large workload looking for CPU improvements and 0.3% of a very large workload. (All of LRUCache is roughly 9% and 2.5% of those workloads.)

Aug 04 '22 15:08 pdillinger

This is among the hottest of hot code, and this change introduces strictly more tracking to be done regardless of enabling the feature. Just look at LRU_Remove code, which alone is 1% of total CPU (nearly 2% of RocksDB CPU) of a large workload looking for CPU improvements and 0.3% of a very large workload. (All of LRUCache is roughly 9% and 2.5% of those workloads.)

I know it's hot - what I'm saying is that I don't expect the cost to be noticeable if the feature is not enabled (low_pri_pool_ratio==0, which is the default) especially when FDO is used.

Aug 04 '22 16:08 ltamasi

I will share more benchmark results later today.

Aug 04 '22 16:08 mdcallag

The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?

Aug 04 '22 18:08 siying

The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?

Our goal is to prioritize SST data blocks over blobs, since with BlobDB data blocks function sort of function as an index. So when those scans happen, we would prefer to keep data blocks in the cache at the expense of blobs.

Aug 04 '22 19:08 ltamasi

My test shows an unexplained small improvement (~2% ops/sec on a benchmark with 50% CPU in LRUCache) with the feature present but disabled, and ~9% overhead with the feature enabled. Assuming Mark's tests come out OK, that satisfies me on the performance front.

Aug 04 '22 19:08 pdillinger

The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs?

Our goal is to prioritize SST data blocks over blobs, since with BlobDB data blocks function sort of function as an index. So when those scans happen, we would prefer to keep data blocks in the cache at the expense of blobs.

The answer to my question is yes, or no?

Aug 04 '22 19:08 siying

The answer to my question is yes, or no?

Sorry I'm not sure I understand your question. If you have all three pools enabled (i.e. high ratio > 0, low ratio > 0), metablocks will go in the high-pri pool, data blocks will go in the low-pri pool, and blobs will go in the bottom-pri pool. So if we have a full scan, blobs will see more thrashing than data blocks, which in turn will see more thrashing than metablocks (which is I think what we want).

Aug 04 '22 19:08 ltamasi

I have results from 3 benchmarks. I will summarize them here, then add 2 more comments with more detail per benchmark.

Two binaries were compared, I assume these are close in the git log

fix - db_bench with the proposed fix, git hash e69183be42, this is the PR
revert - db_bench with the original change reverted, git hash d976f6897, this is RocksDB without the PR

The test server has 256G RAM, 40 cores, 80 HW threads and hyperthreads are enabled.

The benchmarks are:

viewstate - from tools/regression_test.sh, 335G database, uses buffered IO for user reads, 16 client threads
my benchmark.sh, by RX - database is cached by RocksDB
my benchmark.sh, by OS - database is cached by the OS, 1G RocksDB block cache
all use leveled compaction, BlobDB was not evaluated here

For my benchmark.sh

I mean using the wrapper scripts I normally use. The database has 40M KV pairs, 20-byte keys, 400 byte values. Compression was not used. I assume but did not confirm that compression is used for viewstate
tests were repeated for 4 and 16 client threads

Everything was run 3 times and I focus on the median values for QPS and CPU/query

I expect by RX at 4 threads to be the most likely to show a regression. The summary is ...

For by OS:

qps for fix is 0.57% smaller at 4 threads, 0.26% smaller at 16 threads
cpuM/q (cpu/query) for fix is 0.42% larger at 4 threads, 0.18% larger at 16 threads

For by RX:

qps for fix is ~3.5% larger at 4 threads, ~2.0% larger at 16 threads
cpuM/q (cpu/query) for fix is ~3.4% smaller at 4 threads, ~1.9% smaller at 16 threads

For viewstate:

qps for fix is ~0.29% smaller
cpuM/q (cpu/query) for fix is ~0.21% larger

Aug 04 '22 19:08 mdcallag

More detail for by RX and by OS

Using tools/benchmark.sh with 40M KB/pairs, 20-byte key, 400-byte values and no compression. This was run for 4 and 16 client threads. Finally, each benchmark was repeated 3 times and I focus on the median for QPS and CPU overhead.

There were 2 configs:

by OS - 1G RocksDB block cache, all data in OS page cache
by RX - data fits in RocksDB block cache

For "by OS":

qps for fix is 0.57% smaller at 4 threads, 0.26% smaller at 16 threads
cpuM/q (cpu/query) for fix is 0.42% larger at 4 threads, 0.18% larger at 16 threads

For "by RX":

qps for fix is ~3.5% larger at 4 threads, ~2.0% larger at 16 threads
cpuM/q (cpu/query) for fix is ~3.4% smaller at 4 threads, ~1.9% smaller at 16 threads

Legend:
* qps - average throughput
* user, sys - seconds of wall-clock, user CPU and system CPU time per /bin/time, wall clock seconds was ~600
* cpuM/q - (cpu/queries)*1M where cpu is user + system time from db_bench per /bin/time and queries is total number of queries
* cached? - "by RX" if cached by RocksDB, "by OS" otherwise
* threads - number of client threads
* binary - "revert" if db_bench with change reverted, "fix" if with change fixed

--- by OS results

- 4 threads
qps     user    sys     cpuM/q  threads binary
283450  1609    788     14.096  4       revert
283616  1608    789     14.088  4       fix
-
284900  1601    796     14.023  4       revert
282878  1609    788     14.125  4       fix
-
284490  1605    792     14.044  4       revert
281000  1614    783     14.218  4       fix
-
* median QPS is  284490 for revert, 282878 for fix
* relative QPS (median fix / median revert) = 0.9943 -> QPS for fix is 0.57% smaller
* median cpuM/q is 14.044 for revert, 14.125 for fix
* relative cpuM/q (median fix / median revert) = 1.0058 -> cpuM/q for fix is 0.42% larger

- 16 threads
qps     user    sys     cpuM/q  threads binary
1052129 6318    3212    15.095  16      revert
1048474 6321    3203    15.140  16      fix
-
1051213 6334    3198    15.113  16      revert
1052706 6319    3213    15.090  16      fix
-
1049348 6318    3212    15.135  16      revert
1048458 6346    3189    15.157  16      fix
-
* median QPS is 1051213 for revert, 1048474 for fix
* relative QPS (median fix / median revert) = 0.9974 -> QPS for fix is 0.26% smaller
* median cpuM/q is 15.113 for revert, 15.140 for fix
* relative cpuM/q (median fix / median revert) = 1.0018 -> cpuM/q for fix is 0.18% larger

--- by RX results

- 4 threads
qps     user    sys     cpuM/q  threads binary
458115  2331    58      8.691   4       revert
460179  2333    56      8.651   4       fix
-
456688  2326    61      8.711   4       revert
474120  2334    56      8.400   4       fix
-
466457  2333    56      8.538   4       revert
479702  2331    58      8.303   4       fix
-
* median QPS is 474120 for revert, 458115 for fix
* relative QPS (median fix / median revert) = 1.0349 -> QPS for fix is 3.49% larger
* median cpuM/q is 8.691 for revert, 8.400 for fix
* relative cpuM/q (median fix / median revert) = 0.9665 -> cpuM/q for fix is 3.35% smaller

- 16 threads
qps     user    sys     cpuM/q  threads binary
1615606 8797    586     9.680   16      revert
1659593 8789    598     9.427   16      fix
-
1602179 8817    577     9.772   16      revert
1645338 8843    561     9.526   16      fix
-
1620401 8805    586     9.659   16      revert
1647635 8804    583     9.495   16      fix
-
* median QPS is 1615606 for revert, 1647635 for fix
* relative QPS (median fix / median revert) = 1.0198 -> QPS for fix is 1.98% larger
* median cpuM/q is 9.680 for revert, 9.495 for fix
* relative cpuM/q (median fix / median revert) = 0.9809 -> cpuM/q for fix is 1.91% smaller

Aug 04 '22 19:08 mdcallag

For viewstate

This has results from viewstate for readrandom, seekrandomwhilewwriting and readwhilewriting.
I didn't run seekrandom because it tends to have more variance that I have yet to explain.
Tests are run with 16 client threads on a host with 40 cores, 80 HW threads, HT is enabled and 256G RAM.

For the viewstate test there are 128 databases (--num_multi_db=128), the total db size is ~335G,
O_DIRECT is used for compaction, buffered IO is used for user reads and I assume the database
files are compressed.

Notes:
* qps for fix is ~0.29% smaller
* cpuM/q (cpu/query) for fix is ~0.21% larger
* storage reads/query are similar

Legend:
* qps - average throughput
* p50, p99 - response time in microseconds
* p99.9, p99.99 - response time in seconds
* real, user, sys - seconds of wall-clock, user CPU and system CPU time per /bin/time
* cpuM/q - (cpu/queries)*1M where cpu is user + system time from db_bench per /bin/time and queries is total number of queries

Result order is: revert, fix

readrandom
qps     p50     p99     p99.9   p99.99  real    user    sys     cpuM/q  binary
115835  138140  376360  0.548   0.721   1801    17838   2868    99.30   revert
115595  138310  376510  0.549   0.721   1801    17823   2866    99.42   fix
-
115683  138270  376470  0.549   0.730   1801    17850   2864    99.47   revert
115485  138480  376570  0.550   0.728   1801    17839   2863    99.58   fix
-
115818  138180  376400  0.549   0.722   1801    17851   2867    99.37   revert
115444  138360  376680  0.550   0.729   1801    17843   2869    99.66   fix
-
* median QPS is 115818 for revert, 115485 for fix
* ratio of median QPS (fix/revert) = 0.9971 -> QPS for fix is ~0.29% less
* median cpuM/q is 99.37 for revert, 99.58 for fix
* ratio of median cpuM/q (fix / revert) is 1.0021 -> fix uses ~0.21% more CPU/query

seekrandomwhilewriting
qps     p50     p99     p99.9   p99.99  real    user    sys     binary
65909   230490  571870  0.812   1.028   1803    18357   2997    revert
65879   230640  571920  0.813   1.066   1803    18348   2995    fix
-
65864   230650  572060  0.814   1.063   1804    18352   2999    revert
65851   230680  572020  0.814   1.042   1803    18351   2996    fix
-
65895   230590  571960  0.813   1.064   1803    18349   2996    revert
65743   230960  572080  0.814   1.060   1803    18370   2990    fix
-
* median QPS is 65895 for revert, 65851 for fix
* ratio of median QPS (fix/revert) = 0.9993 -> QPS for fix is ~0.07% less

readwhilewriting
qps     p50     p99     p99.9   p99.99  real    user    sys     binary
113692  139610  377860  0.558   0.880   1853    17557   2994    revert
113565  139770  377810  0.558   0.869   1853    17559   3005    fix
-
113589  139700  377940  0.559   0.864   1852    17551   2990    revert
113590  139680  377870  0.558   0.857   1854    17574   2994    fix
-
113773  139600  377850  0.558   0.864   1853    17558   2995    revert
113507  139770  377890  0.558   0.865   1853    17580   2991    fix
-
* median QPS is 113692 for revert, 113565 for fix
* ratio of media QPS (fix/revert) = 0.9989 -> QPS for fix is 0.11% less

Aug 04 '22 19:08 mdcallag

The answer to my question is yes, or no?

Sorry I'm not sure I understand your question. If you have all three pools enabled (i.e. high ratio > 0, low ratio > 0), metablocks will go in the high-pri pool, data blocks will go in the low-pri pool, and blobs will go in the bottom-pri pool. So if we have a full scan, blobs will see more thrashing than data blocks, which in turn will see more thrashing than metablocks (which is I think what we want).

The first time a data block is inserted, it goes to low-pri pool, but if it gets hit again, it is promoted to the high pri pool. On the other hand, what is evicted from the high pri pool will be inserted in to the low-pri pool. Metablock in high pri pool is just a paranoid optimization, rather than the main purpose of the two-pool approach.

Aug 04 '22 20:08 siying

Metablock in high pri pool is just a paranoid optimization, rather than the main purpose of the two-pool approach.

One data point regarding that: technically the root cause of the earlier regression was that when there was a low-pri pool configured but no high-pri one, metablocks ended up at the bottom (below data blocks) and got evicted first. This suggests to me that where we insert them in the LRU list does make a difference.

Aug 04 '22 22:08 ltamasi

@siying @mdcallag Please let us know if you have any concerns about landing this considering the performance numbers above. Thanks!

Aug 08 '22 17:08 ltamasi

Did a test with BlobDB to quantify the effects of the bottommost cache priority. After populating a smallish (fits in the OS page cache) DB with 1KB-sized values using fillrandom, I compared readrandom performance (with 16 threads) using a 1GB shared block/blob cache with two settings. Baseline was high_pri_pool_ratio=0.1 (enough to hold all the metablocks) and low_pri_pool_ratio=0.0 (default). The other configuration had the new feature enabled: high_pri_pool_ratio=0.1 (same as before) and low_pri_pool_ratio=0.4.

I repeated the tests three times. Here are the data block cache miss counters for the baseline (note: there were no cache misses for index blocks after the initial load with either configuration):

rocksdb.block.cache.data.miss COUNT : 371737
rocksdb.block.cache.data.miss COUNT : 371568
rocksdb.block.cache.data.miss COUNT : 372248

With the low-priority pool enabled:

rocksdb.block.cache.data.miss COUNT : 99611
rocksdb.block.cache.data.miss COUNT : 99625
rocksdb.block.cache.data.miss COUNT : 99708

So data blocks saw much less churn with the low-priority pool enabled, which is what we would want since we have to go through them to get to the blobs. This also translated to higher ops/sec: baseline was 617698/633135/635381 for the three runs vs 620729/636272/643619 with the feature enabled.

Aug 12 '22 02:08 ltamasi

Did a test with BlobDB to quantify the effects of the bottommost cache priority. After populating a smallish (fits in the OS page cache) DB with 1KB-sized values using fillrandom, I compared readrandom performance (with 16 threads) using a 1GB shared block/blob cache with two settings. Baseline was high_pri_pool_ratio=0.1 (enough to hold all the metablocks) and low_pri_pool_ratio=0.0 (default). The other configuration had the new feature enabled: high_pri_pool_ratio=0.1 (same as before) and low_pri_pool_ratio=0.4.

I repeated the tests three times. Here are the data block cache miss counters for the baseline (note: there were no cache misses for index blocks after the initial load with either configuration):
rocksdb.block.cache.data.miss COUNT : 371737
rocksdb.block.cache.data.miss COUNT : 371568
rocksdb.block.cache.data.miss COUNT : 372248
With the low-priority pool enabled:
rocksdb.block.cache.data.miss COUNT : 99611
rocksdb.block.cache.data.miss COUNT : 99625
rocksdb.block.cache.data.miss COUNT : 99708
So data blocks saw much less churn with the low-priority pool enabled, which is what we would want since we have to go through them to get to the blobs. This also translated to higher ops/sec: baseline was 617698/633135/635381 for the three runs vs 620729/636272/643619 with the feature enabled.

Good results! I assume total read I/O bandwidth also improved. Can you post the number here for a record?

Aug 12 '22 16:08 siying

Some more stats about overall cache misses as discussed with @siying :

Baseline:

rocksdb.block.cache.miss COUNT : 371740
rocksdb.blobdb.cache.miss COUNT : 111060940
rocksdb.block.cache.miss COUNT : 371571
rocksdb.blobdb.cache.miss COUNT : 111062668
rocksdb.block.cache.miss COUNT : 372251
rocksdb.blobdb.cache.miss COUNT : 111062404

With the feature enabled:

rocksdb.block.cache.miss COUNT : 99614
rocksdb.blobdb.cache.miss COUNT : 111078933
rocksdb.block.cache.miss COUNT : 99628
rocksdb.blobdb.cache.miss COUNT : 111070234
rocksdb.block.cache.miss COUNT : 99711
rocksdb.blobdb.cache.miss COUNT : 111069999

So there's fewer misses overall and also less I/O: there's a ~270,000 reduction in SST block misses in exchange for a ~10,000-20000 increase in blob misses (with 1-KB blobs which are thus smaller than SST blocks).

Aug 12 '22 18:08 ltamasi