rocksdb
rocksdb copied to clipboard
Space-amp with universal is unpredictable because old SSTs are not deleted when there are long-running compactions
When running tools/benchmark.sh for universal compaction with a database much larger than memory I get unexpected out of disk space errors. The interesting details are:
- compaction IO stats show the database is ~1TB (per the Sum line)
- Options.sst_file_manager.rate_bytes_per_sec: 0, this is with db_bench and deletes are not rate limited
- Options.compaction_options_universal.max_size_amplification_percent: 100
The problem is that SST files are not deleted because of the min_pending_output check in GetObsoleteFiles. But I will save that for the next comment on this issue.
Compaction IO stats when Sum shows the current database is ~1TB but the database directory has up to ~7TB of SST files -- the size grows and shrinks.
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 9/6 430.21 MB 0.0 680.1 0.0 680.1 1006.9 326.9 0.0 3.1 156.7 232.1 4443.50 3165.16 32785 0.136 1697M 53K 0.0 0.0
L8 0/0 0.00 KB 0.0 0.1 0.1 0.0 0.1 0.1 0.0 1.0 642.3 641.8 0.24 0.46 1 0.238 373K 19 0.0 0.0
L9 0/0 0.00 KB 0.0 0.1 0.1 0.0 0.1 0.1 0.0 1.0 842.3 842.0 0.17 0.50 2 0.086 354K 2 0.0 0.0
L10 0/0 0.00 KB 0.0 0.2 0.2 0.0 0.2 0.2 0.0 1.0 902.7 902.4 0.19 0.47 3 0.063 418K 3 0.0 0.0
L11 0/0 0.00 KB 0.0 0.3 0.3 0.0 0.3 0.3 0.0 1.0 816.2 815.7 0.41 1.22 5 0.082 812K 8 0.0 0.0
L12 0/0 0.00 KB 0.0 0.8 0.5 0.3 0.8 0.5 0.0 1.5 574.3 574.1 1.46 2.83 14 0.104 2038K 14 0.0 0.0
L13 0/0 0.00 KB 0.0 2.0 1.4 0.6 2.0 1.4 0.0 1.4 875.6 875.2 2.30 6.71 22 0.105 4912K 106 0.0 0.0
L14 0/0 0.00 KB 0.0 3.6 2.1 1.5 3.6 2.1 0.0 1.7 892.7 892.4 4.09 11.54 30 0.136 8904K 229 0.0 0.0
L15 0/0 0.00 KB 0.0 5.0 2.7 2.3 5.0 2.7 0.0 1.9 877.2 876.9 5.84 17.00 39 0.150 12M 372 0.0 0.0
L16 0/0 0.00 KB 0.0 4.3 3.0 1.3 4.3 3.0 0.0 1.4 803.9 803.6 5.49 14.79 44 0.125 10M 308 0.0 0.0
L17 0/0 0.00 KB 0.0 6.6 4.3 2.3 6.6 4.3 0.0 1.5 829.8 829.5 8.19 21.97 66 0.124 16M 357 0.0 0.0
L18 0/0 0.00 KB 0.0 10.5 6.1 4.4 10.5 6.1 0.0 1.7 799.4 799.2 13.43 35.48 96 0.140 26M 536 0.0 0.0
L19 0/0 0.00 KB 0.0 11.5 7.9 3.6 11.5 7.8 0.0 1.5 808.6 808.4 14.51 37.61 112 0.130 28M 659 0.0 0.0
L20 0/0 0.00 KB 0.0 27.6 13.0 14.6 27.6 13.0 0.0 2.1 881.4 881.2 32.02 88.85 144 0.222 68M 3504 0.0 0.0
L21 0/0 0.00 KB 0.0 35.7 18.3 17.4 35.7 18.3 0.0 1.9 853.0 852.7 42.86 117.55 193 0.222 89M 5649 0.0 0.0
L22 0/0 0.00 KB 0.0 49.6 26.7 22.9 49.6 26.6 0.0 1.9 870.2 869.9 58.34 163.70 258 0.226 123M 10K 0.0 0.0
L23 0/0 0.00 KB 0.0 72.6 34.2 38.4 72.6 34.2 0.0 2.1 871.1 870.8 85.38 242.59 320 0.267 181M 12K 0.0 0.0
L24 0/0 0.00 KB 0.0 78.2 44.2 34.0 78.2 44.2 0.0 1.8 851.4 851.2 94.06 268.18 321 0.293 195M 16K 0.0 0.0
L25 0/0 0.00 KB 0.0 139.2 59.2 80.0 139.1 59.1 0.0 2.4 886.7 886.5 160.72 460.42 375 0.429 347M 29K 0.0 0.0
L26 42/42 642.12 MB 0.0 176.3 78.3 98.0 176.3 78.2 0.0 2.3 872.6 872.2 206.95 580.45 362 0.572 440M 107K 0.0 0.0
L27 69/69 1.07 GB 0.0 251.9 96.9 155.0 251.7 96.7 0.0 2.6 873.2 872.4 295.44 825.54 379 0.780 629M 430K 0.0 0.0
L28 187/187 2.91 GB 0.0 203.2 102.1 101.1 203.2 102.0 0.0 2.0 813.0 812.6 256.00 677.66 399 0.642 507M 128K 0.0 0.0
L29 0/0 0.00 KB 0.0 304.8 116.7 188.1 304.5 116.4 0.0 2.6 808.3 807.7 386.07 1001.15 343 1.126 761M 447K 0.0 0.0
L30 0/0 0.00 KB 0.0 306.8 121.7 185.1 306.6 121.5 0.0 2.5 887.8 887.4 353.81 993.57 280 1.264 766M 238K 0.0 0.0
L31 0/0 0.00 KB 0.0 455.3 131.8 323.5 454.8 131.3 0.0 3.5 850.7 849.8 548.04 1477.33 219 2.502 1138M 965K 0.0 0.0
L32 0/0 0.00 KB 0.0 313.6 114.8 198.8 313.2 114.5 0.0 2.7 787.2 786.4 407.87 1046.96 108 3.777 783M 667K 0.0 0.0
L33 1200/1200 18.83 GB 0.0 270.3 99.0 171.2 269.9 98.7 0.0 2.7 829.0 827.8 333.84 872.59 48 6.955 675M 791K 0.0 0.0
L34 2616/2616 41.07 GB 0.0 373.2 150.1 223.1 372.5 149.4 0.0 2.5 994.2 992.2 384.38 1177.81 27 14.236 933M 1689K 0.0 0.0
L35 15301/15301 240.39 GB 0.0 1716.1 259.1 1457.0 1697.4 240.4 0.0 6.6 852.1 842.8 2062.27 6321.49 22 93.740 4539M 46M 0.0 0.0
L36 0/0 0.00 KB 0.0 15.2 0.0 15.2 27.6 12.4 0.0 2103.8 631.0 1147.4 24.63 86.06 1 24.634 69M 585 0.0 0.0
L39 55055/0 867.37 GB 0.0 890.9 27.6 863.3 867.4 4.1 238.7 31.4 335.2 326.3 2721.62 9388.96 1 2721.624 4000M 32K 0.0 0.0
Sum 74479/19421 1.15 TB 0.0 6405.4 1522.5 4882.9 6699.7 1816.8 238.7 20.5 506.3 529.6 12954.12 29106.61 37019 0.350 18G 52M 0.0 0.0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Low 0/0 0.00 KB 0.0 6405.4 1522.5 4882.9 6372.5 1489.6 0.0 0.0 604.8 601.7 10845.32 27836.22 15617 0.694 18G 52M 0.0 0.0
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 327.2 327.2 0.0 0.0 0.0 158.9 2108.81 1270.39 21402 0.099 0 0 0.0 0.0
Uptime(secs): 2973.3 total, 20.0 interval
Flush(GB): cumulative 327.177, interval 3.014
Cumulative compaction: 6699.69 GB write, 2307.33 MB/s write, 6405.42 GB read, 2205.98 MB/s read, 12954.1 seconds
Interval compaction: 42.75 GB write, 2184.91 MB/s write, 39.79 GB read, 2033.79 MB/s read, 77.0 seconds
Stalls(count): 708 level0_slowdown, 708 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 268 memtable_slowdown, interval 16 total count
Block cache LRUCache@0x7f3618e28290#1898157 capacity: 210.00 GB collections: 2 last_copies: 3 last_secs: 7.19322 secs_since: 2364
Block cache entry stats(count,size,portion): DataBlock(12527444,96.49 GB,45.9464%) FilterBlock(70176,4.17 GB,1.98532%) IndexBlock(134124,7.23 GB,3.44461%) Misc(16,121.17 KB,5.50279e-05%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 2973.3 total, 20.0 interval
Cumulative writes: 816M writes, 816M keys, 59M commit groups, 13.8 writes per commit group, ingest: 323.00 GB, 111.24 MB/s
Cumulative WAL: 816M writes, 0 syncs, 816296033.00 writes per sync, written: 323.00 GB, 111.24 MB/s
Cumulative stall: 00:01:59.793 H:M:S, 4.0 percent
num-running-compactions: 6
num-running-flushes: 2
Partial output from a script that monitors what is in the database directory. The script does this every 3 seconds:
b_gb=$( ls -l $DB_DIR 2> /dev/null | grep blob | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
s_gb=$( ls -l $DB_DIR 2> /dev/null | grep sst | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
l_gb=$( ls -l $WAL_DIR 2> /dev/null | grep log | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
a_gb=$( ls -l $DB_DIR 2> /dev/null | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
du1_gb=$( du -hs -BG $DB_DIR 2> /dev/null | awk '{ print $1 }' )
du2_gb=$( du -hs -BG --apparent-size $DB_DIR 2> /dev/null | awk '{ print $1 }' )
dbbpid=$( pidof db_bench )
nf1=$( ls $DB_DIR | wc -l )
nf1S=$( ls $DB_DIR | grep sst | wc -l )
nf1L=$( ls $DB_DIR | grep log | wc -l )
nf1T=$( ls $DB_DIR | grep trash | wc -l )
nf2=$( lsof -p $dbbpid | grep REG | wc -l )
ts=$( date +%H%M%S )
echo -e "${a_gb}\t${s_gb}\t${l_gb}\t${b_gb}\t${du1_gb}\t${du2_gb}\t${nf1}\t${nf1S}\t${nf1L}\t${nf1T}\t${nf2}\t${ts}"
The output below shows the database directory reaches ~7TB from SST files.
Legend:
* tot-sz - total size of database directory (GB)
* sst-sz - size of SST files (GB), per ls
* log-sz - size of *.log files (GB), per ls
* blob-sz - size of blob files (GB), per ls
* du1-sz - size of database directory (GB) per du
* du2-sz - size of database directory (GB) per du --apparent-size
* num-f - number of files in database directory
* num-s - number of SSTs in database directory
* num-l - number of *.log files in database directory
* num-t - number of trash files in database directory
* lsof - number of open REG files for db_bench, per lsof
* time - timestamp as HHMMSS
tot-sz sst-sz log-sz blob-sz du1-sz du2-sz num-f num-s num-l num-t lsof time
6452.5 6441.7 0.1 0.0 6459G 6460G 399818 399870 7 0 353694 162127
6500.7 6490.3 0.0 0.0 6507G 6507G 402713 402750 7 0 354660 162151
6550.6 6538.3 0.1 0.0 6557G 6557G 405819 405866 5 0 355715 162215
6599.0 6587.0 0.0 0.0 6606G 6606G 408818 408833 6 0 356820 162241
6654.3 6639.3 0.0 0.0 6661G 6662G 412352 412408 5 0 357888 162306
6710.1 6697.6 0.0 0.0 6717G 6718G 415784 415804 5 0 360077 162329
6763.6 6749.5 0.1 0.0 6771G 6772G 419193 419246 4 0 360980 162355
6824.3 6809.3 0.1 0.0 6830G 6831G 422882 422936 4 0 363957 162420
6875.0 6862.0 0.1 0.0 6883G 6883G 426225 426244 7 0 364862 162445
6937.2 6921.1 0.0 0.0 6945G 6945G 429925 429988 4 0 367690 162512
6995.7 6982.4 0.0 0.0 7004G 7005G 433664 433701 4 0 369381 162538
7050.3 7039.0 0.0 0.0 7057G 7057G 436879 436923 5 0 385782 162604
7093.1 7081.1 0.0 0.0 7100G 7100G 439690 439744 4 0 387251 162632
7144.7 7132.8 0.0 0.0 7150G 7149G 442781 442797 5 0 390733 162659
7187.1 7175.3 0.0 0.0 7194G 7195G 445652 445695 5 0 391608 162725
7239.0 7227.9 0.0 0.0 7243G 7243G 448672 448706 6 0 395869 162754
7274.9 7265.4 0.1 0.0 7280G 7281G 451128 451166 4 0 397864 162819
7312.4 7303.0 0.0 0.0 7318G 7319G 453595 453630 6 0 400019 162845
7350.9 7343.0 0.1 0.0 7356G 7356G 455960 455989 6 0 401454 162913
7382.7 7374.9 0.0 0.0 7389G 7388G 458012 458056 4 0 416763 162941
7223.1 7408.1 0.1 0.0 7090G 7054G 435703 434914 9 0 409882 163007
5812.4 6072.5 0.1 0.0 5732G 5699G 353339 352166 9 0 336315 163030
4771.7 5020.7 0.1 0.0 4685G 4647G 288341 286658 10 0 268862 163049
3756.9 3938.8 0.1 0.0 3704G 3680G 229150 228282 9 0 214843 163105
2917.6 3037.0 0.1 0.0 2874G 2855G 184359 184342 9 0 172238 163117
2293.8 2387.3 0.1 0.0 2236G 2219G 147041 145988 9 0 142646 163129
2191.5 2185.6 0.0 0.0 2194G 2194G 146577 146586 2 0 144174 163140
2220.8 2214.4 0.0 0.0 2223G 2223G 148278 148278 3 0 145885 163151
2248.7 2242.1 0.0 0.0 2251G 2251G 149918 149931 3 0 146501 163204
2276.9 2270.4 0.0 0.0 2279G 2279G 151558 151569 6 0 148687 163215
2305.1 2298.9 0.0 0.0 2307G 2308G 153279 153288 4 0 149455 163227
2335.5 2328.2 0.1 0.0 2338G 2338G 155102 155110 4 0 151678 163239
2363.2 2356.0 0.0 0.0 2366G 2367G 156767 156781 4 0 152698 163251
2397.2 2389.9 0.1 0.0 2400G 2400G 158685 158695 4 0 153607 163304
2425.1 2420.1 0.0 0.0 2427G 2427G 160372 160379 6 0 156117 163316
2452.9 2446.2 0.0 0.0 2455G 2455G 161982 161992 6 0 157266 163328
2481.5 2475.2 0.0 0.0 2484G 2484G 163572 163585 3 0 157855 163340
2511.8 2505.4 0.0 0.0 2515G 2514G 165116 165119 6 0 158368 163352
2538.8 2532.4 0.1 0.0 2542G 2542G 166442 166459 5 0 161064 163405
2572.0 2565.3 0.0 0.0 2576G 2575G 168425 168431 7 0 162030 163417
2605.9 2598.0 0.1 0.0 2610G 2609G 170489 170504 4 0 163244 163429
2638.7 2631.6 0.0 0.0 2642G 2641G 172291 172301 7 0 166125 163442
2665.6 2659.2 0.1 0.0 2670G 2668G 173492 173499 5 0 166802 163455
2699.8 2691.1 0.0 0.0 2705G 2705G 175462 175464 5 0 167658 163509
2739.3 2730.8 0.1 0.0 2743G 2742G 177814 177830 4 0 169050 163522
2772.6 2765.4 0.0 0.0 2776G 2776G 179935 179951 5 0 172912 163536
2802.6 2795.6 0.1 0.0 2806G 2806G 181672 181687 4 0 174264 163549
2836.4 2829.1 0.0 0.0 2840G 2839G 183708 183726 5 0 175223 163602
2873.2 2863.1 0.1 0.0 2878G 2878G 185920 185934 4 0 176299 163617
2914.6 2905.9 0.1 0.0 2919G 2919G 188479 188504 5 0 177991 163631
To reproduce this I start with this patch:
diff --git a/db/db_impl/db_impl_files.cc b/db/db_impl/db_impl_files.cc
index da87aa5cd..e011dad50 100644
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@@ -526,6 +526,10 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
number >= state.min_pending_output;
if (!keep) {
files_to_del.insert(number);
+ } else {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme live %d c1=%d c2=%d", (int) number,
+ sst_live_set.find(number) != sst_live_set.end(),
+ number >= state.min_pending_output);
}
break;
case kBlobFile:
@@ -602,9 +606,15 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
// = -1, then no need to delete or schedule delete these files since they
// will be removed by their owner, e.g. the primary instance.
if (!own_files) {
+ if (type == kTableFile) {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme notown %d", (int) number);
+ }
continue;
}
if (schedule_only) {
+ if (type == kTableFile) {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme scheduleonly %d", (int) number);
+ }
InstrumentedMutexLock guard_lock(&mutex_);
SchedulePendingPurge(fname, dir_to_sync, type, number, state.job_id);
} else {
diff --git a/db/version_set.cc b/db/version_set.cc
index 9d37db84f..af68f8b06 100644
--- a/db/version_set.cc
+++ b/db/version_set.cc
@@ -6109,6 +6109,7 @@ void VersionSet::GetObsoleteFiles(std::vector<ObsoleteFileInfo>* files,
}
}
obsolete_files_.swap(pending_files);
+ ROCKS_LOG_INFO(db_options_->info_log, "Keepme getobsolete %" PRIu64 " min_pending, %d of %d moved\n", min_pending_output, (int) files->size(), (int) (files->size() + obsolete_files_.size()));
std::vector<ObsoleteBlobFileInfo> pending_blob_files;
for (auto& blob_file : obsolete_blob_files_) {
And then I run db_bench like this. The problems occur during overwrite.
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --universal_compression_size_percent=-1 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1660684521 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_fillseq.wal_disabled.v400.log.r.csv
numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --duration=7200 --threads=32 --merge_operator="put" --seed=1660689764 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_overwrite.t32.s0.log.r.csv
Then I grep LOG:
grep -E 'table_file_del|compaction_finished|Keepme' /data/m/rx/LOG
In these examples overwrite starts at 15:42:46. From the previous comment the database directory reaches ~7TB between 162604 and 163007. There is a table_file_deletion event at 15:43:26
2022/08/16-15:43:26.280134 1898167 EVENT_LOG_v1 {"time_micros": 1660689806280130, "job": 1104, "event": "table_file_deletion", "file_number": 67046}
But then no more until 16:29:50, which explains why the database directory reaches 7TB.
2022/08/16-16:29:50.371093 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371087, "job": 821, "event": "table_file_deletion", "file_number": 479984}
2022/08/16-16:29:50.371173 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371171, "job": 821, "event": "table_file_deletion", "file_number": 479983}
From the logging I added to GetObsoleteFiles, the number of files in obsolete_files_ reaches 317006. None are moved to files (meaning they can be deleted) because min_pending_output=67303 from 15:43:18 until 16:29:49.
2022/08/16-15:43:18.560721 1898169 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 55 moved
...
2022/08/16-16:29:49.554543 1898170 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.627389 1898172 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.802757 1898164 [/version_set.cc:6112] Keepme getobsolete 480013 min_pending, 317006 of 319190 moved
I assume this is the long running compaction that blocked things (kept min_pending_output at 67303) because compaction_time_micros ~= 2721 seconds and total_output_size ~= 867GB
2022/08/16-16:29:50.002195 1898164 (Original Log Time 2022/08/16-16:29:49.265941) EVENT_LOG_v1 {"time_micros": 1660692589249330, "job": 821, "event": "compaction_finished", "compaction_time_micros": 2721624360, "compaction_time_cpu_micros": 9388960476, "output_level": 39, "num_output_files": 55055, "total_output_size": 931328757382, "num_input_records": 4000032144, "num_output_records": 4000000000, "num_subcompactions": 4, "output_compression": "LZ4", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 80, 192, 443, 2616, 15301, 0, 0, 0, 55055]}
Grepping the compaction IO stats output during overwrite, from the Sum column the LSM size doesn't exceed 1.23TB. Assuming the worst-case of 2X transient space-amp (all files compacted concurrently) the database directory shouldn't exceed ~2.46TB.
For my sake, the call stack to delete an SST after compaction ends is this. However, that depends on the call to FindObsoleteFiles finding files that can be deleted, and FindObsoleteFiles depends on GetObsoleteFiles finding the files which doesn't happen as explained above. MinObsoleteSstNumberToKeep is also relevant.
#0 0x00000000006a029d in rocksdb::DeleteScheduler::DeleteFile (this=this@entry=0x7ffff76d8440, file_path=..., dir_to_sync=..., force_bg=force_bg@entry=false) at file/delete_scheduler.cc:63
#1 0x00000000006aba6d in rocksdb::SstFileManagerImpl::ScheduleFileDeletion (this=0x7ffff76d8380, file_path=..., path_to_sync=..., force_bg=<optimized out>) at file/sst_file_manager_impl.cc:423
#2 0x00000000006a2209 in rocksdb::DeleteDBFile (db_options=db_options@entry=0x7ffff76e9330, fname=..., dir_to_sync=..., force_bg=force_bg@entry=false, force_fg=<optimized out>) at file/file_util.cc:123
#3 0x00000000005435ad in rocksdb::DBImpl::DeleteObsoleteFileImpl (this=0x7ffff76e9000, job_id=6, fname=..., path_to_sync=..., type=rocksdb::kTableFile, number=15) at db/db_impl/db_impl_files.cc:357
#4 0x0000000000549385 in rocksdb::DBImpl::PurgeObsoleteFiles (this=this@entry=0x7ffff76e9000, state=..., schedule_only=schedule_only@entry=false) at db/db_impl/db_impl_files.cc:608
#5 0x000000000053797e in rocksdb::DBImpl::BackgroundCallCompaction (this=0x7ffff76e9000, prepicked_compaction=0x0, bg_thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2984
#6 0x000000000053812e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2670
#7 0x00000000007c13a8 in std::function<void ()>::operator()() const (this=0x7ffff69fd010) at /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/std_function.h:688
#8 rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x7ffff7630200, thread_id=0) at util/threadpool_imp.cc:319
#9 0x00000000007c1743 in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x7ffff7667f70) at util/threadpool_imp.cc:360
#10 0x00007ffff7ea9661 in std::execute_native_thread_routine (__p=0x7ffff76b2880) at ../../../.././libstdc++-v3/src/c++11/thread.cc:80
#11 0x00007ffff7db720c in start_thread (arg=0x7ffff69fe700) at pthread_create.c:479
#12 0x00007ffff7b7616f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
On the bright side, this is explained in the wiki.
To prevent it from happening, we take use of a good feature that new files are created using the file name of an incremental number. Before each flush or compaction job runs, we remember the number of latest SST file created at that time. If a full garbage collection runs before the job finishes, all SST files with number larger than that will be kept. The condition is released after the job is finished. Since multiple flush and compaction jobs can run in parallel, all SST files with number larger than number the earliest ongoing job remembers will be kept. It possible that we have some false positive, but they will be cleared up eventually.
This has compaction IO stats from the end of overwrite for tests that use buffered IO and then O_DIRECT for compaction. The results are from RocksDB version 6.20
First from a server in the Google cloud with 30 CPUs (hyperthread disabled), 22 client threads, 64G RAM and 3TB of EBS "persistent SSD" storage with max_background_jobs=8. I reduced the size of the database in half (loads 2B KV pairs) to avoid the out of space error. The results are here.
- the value for Comp(sec) is much larger than CompMergeCPU(sec) because the compaction threads wait a lot on IO
- target_file_size_base is 16MB, bytes_per_sync is 8MB, compaction_readahead_size is 2MB. For future tests I might need to reduce bytes_per_sync to 1MB or 2MB.
From the buffered IO result:
- For L38 there was 1 compaction, Write(GB)=93.7, Comp(sec)=820.93 or ~114 MB/s
From the O_DIRECT result:
- for L38 there were 3 compactions, Write(GB)=226.4, Comp(sec)=1616.09 or ~140 MB/s
From a faster server - 256G RAM, 80 HW threads (40 CPUs with hyperthread enabled) and fast NVMe direct attached SSD the results are here.
Looking at the stats for L38 with buffered IO:
- 1 compaction, Write(GB)=222.3, Comp(sec)=223.67 or 993 MB/s
And with O_DIRECT
- 1 compaction, Write(GB)=321.3, Comp(sec)=991.23 or 324 MB/s