rocksdb Skip swaths of range tombstone covered keys in merging iterator (2022 edition)

Summary: With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone [start, end) from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:

in Seek(target), if level L has a range tombstone [start, end) that covers target.UserKey, then for all levels > L, we can do the initial Seek on end instead of target to skip some range tombstone covered keys.
in Next()/Prev(), if the current key is covered by a range tombstone [start, end) from level L, we can Seek to the end for all levels > L.

This PR implements the above optimizations in MergingIterator. As all range tombstone covered keys are now skipped in MergingIterator, the range tombstone logic is removed from DBIter. The idea in this PR is similar to #7317, but this PR leaves InternalIterator interface unchanged. Credit: the cascading seek optimization and the sentinel key (discussed below) are inspired by Pebble and suggested by @ajkr in #7317.

The two optimizations are mostly implemented in SeekImpl()/SeekForPrevImpl() and IsNextDeleted()/IsPrevDeleted() in merging_iterator.cc. See comments for each method for more detail.

One major change, sentinel range tombstone key, is made to LevelIterator. Before this PR, when all point keys in an SST file are iterated through in MergingIterator, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, MergingIterator could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys, when the boundaries are defined by some range tombstone. A LevelIterator now only advance the file iterator once the sentinel key is processed.

Test plan:

Added many unit tests in db_range_del_test
Stress test: ./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2
Performance benchmark: I used a similar setup as in the blog post that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width.

# Setup: 
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50

# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true

# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true

# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true

Avg over of 5 runs (some slower tests had fews runs):

For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.

Scan entire DB

tombstone	Pre-PR ops/sec	Post-PR ops/sec	±%
0	2522114 (± 54748)	2373930 (± 57270)	-5.88%
100	1831801 (± 35137)	1853007 (± 23386)	+1.16%
1000	431380 (± 5523)	1144335 (± 18289)	+165.27%
10000	23171 (± 430)	233767 (± 9013)	+908.88%
1	2448520 (± 41681)	2295892 (± 54856)	-6.23%

Short range scan

tombstone width	Pre-PR ops/sec	Post-PR ops/sec	±%
0	36196 (± 848)	35489 (± 603)	-1.95%
100	28295 (± 534)	29804 (± 567)	+5.33%
1000	8252 (± 150)	24998 (± 222)	+202.93%
10000	1289 (± 5)	30722 (± 287)	+2283.4%
1	34786 (± 476)	33114 (± 395)	-4.81%

Long range scan

tombstone width	Pre-PR ops/sec	Post-PR ops/sec	±%
0	2282 (± 38)	2266 (± 51)	-0.7%
100	1631 (± 64)	1668 (± 41)	+2.27%
1000	382 (± 16)	1100 (± 18)	+187.96%
10000	24 (± 0)	658 (± 21)	+2641.67%
1	2113 (± 49)	2026 (± 52)	-4.12%

TODO: this change touches critical iterator code, so I want to think of more stress tests to do, and consider iterator lower/upper bound, prefix scan, and user defined timestamp.

Jul 31 '22 20:07 cbi42

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.