rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Space-amp with universal is unpredictable because old SSTs are not deleted when there are long-running compactions

Open mdcallag opened this issue 1 year ago • 4 comments

When running tools/benchmark.sh for universal compaction with a database much larger than memory I get unexpected out of disk space errors. The interesting details are:

  • compaction IO stats show the database is ~1TB (per the Sum line)
  • Options.sst_file_manager.rate_bytes_per_sec: 0, this is with db_bench and deletes are not rate limited
  • Options.compaction_options_universal.max_size_amplification_percent: 100

The problem is that SST files are not deleted because of the min_pending_output check in GetObsoleteFiles. But I will save that for the next comment on this issue.

Compaction IO stats when Sum shows the current database is ~1TB but the database directory has up to ~7TB of SST files -- the size grows and shrinks.

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      9/6   430.21 MB   0.0    680.1     0.0    680.1    1006.9    326.9       0.0   3.1    156.7    232.1   4443.50           3165.16     32785    0.136   1697M    53K       0.0       0.0
  L8      0/0    0.00 KB   0.0      0.1     0.1      0.0       0.1      0.1       0.0   1.0    642.3    641.8      0.24              0.46         1    0.238    373K     19       0.0       0.0
  L9      0/0    0.00 KB   0.0      0.1     0.1      0.0       0.1      0.1       0.0   1.0    842.3    842.0      0.17              0.50         2    0.086    354K      2       0.0       0.0
 L10      0/0    0.00 KB   0.0      0.2     0.2      0.0       0.2      0.2       0.0   1.0    902.7    902.4      0.19              0.47         3    0.063    418K      3       0.0       0.0
 L11      0/0    0.00 KB   0.0      0.3     0.3      0.0       0.3      0.3       0.0   1.0    816.2    815.7      0.41              1.22         5    0.082    812K      8       0.0       0.0
 L12      0/0    0.00 KB   0.0      0.8     0.5      0.3       0.8      0.5       0.0   1.5    574.3    574.1      1.46              2.83        14    0.104   2038K     14       0.0       0.0
 L13      0/0    0.00 KB   0.0      2.0     1.4      0.6       2.0      1.4       0.0   1.4    875.6    875.2      2.30              6.71        22    0.105   4912K    106       0.0       0.0
 L14      0/0    0.00 KB   0.0      3.6     2.1      1.5       3.6      2.1       0.0   1.7    892.7    892.4      4.09             11.54        30    0.136   8904K    229       0.0       0.0
 L15      0/0    0.00 KB   0.0      5.0     2.7      2.3       5.0      2.7       0.0   1.9    877.2    876.9      5.84             17.00        39    0.150     12M    372       0.0       0.0
 L16      0/0    0.00 KB   0.0      4.3     3.0      1.3       4.3      3.0       0.0   1.4    803.9    803.6      5.49             14.79        44    0.125     10M    308       0.0       0.0
 L17      0/0    0.00 KB   0.0      6.6     4.3      2.3       6.6      4.3       0.0   1.5    829.8    829.5      8.19             21.97        66    0.124     16M    357       0.0       0.0
 L18      0/0    0.00 KB   0.0     10.5     6.1      4.4      10.5      6.1       0.0   1.7    799.4    799.2     13.43             35.48        96    0.140     26M    536       0.0       0.0
 L19      0/0    0.00 KB   0.0     11.5     7.9      3.6      11.5      7.8       0.0   1.5    808.6    808.4     14.51             37.61       112    0.130     28M    659       0.0       0.0
 L20      0/0    0.00 KB   0.0     27.6    13.0     14.6      27.6     13.0       0.0   2.1    881.4    881.2     32.02             88.85       144    0.222     68M   3504       0.0       0.0
 L21      0/0    0.00 KB   0.0     35.7    18.3     17.4      35.7     18.3       0.0   1.9    853.0    852.7     42.86            117.55       193    0.222     89M   5649       0.0       0.0
 L22      0/0    0.00 KB   0.0     49.6    26.7     22.9      49.6     26.6       0.0   1.9    870.2    869.9     58.34            163.70       258    0.226    123M    10K       0.0       0.0
 L23      0/0    0.00 KB   0.0     72.6    34.2     38.4      72.6     34.2       0.0   2.1    871.1    870.8     85.38            242.59       320    0.267    181M    12K       0.0       0.0
 L24      0/0    0.00 KB   0.0     78.2    44.2     34.0      78.2     44.2       0.0   1.8    851.4    851.2     94.06            268.18       321    0.293    195M    16K       0.0       0.0
 L25      0/0    0.00 KB   0.0    139.2    59.2     80.0     139.1     59.1       0.0   2.4    886.7    886.5    160.72            460.42       375    0.429    347M    29K       0.0       0.0
 L26     42/42  642.12 MB   0.0    176.3    78.3     98.0     176.3     78.2       0.0   2.3    872.6    872.2    206.95            580.45       362    0.572    440M   107K       0.0       0.0
 L27     69/69   1.07 GB   0.0    251.9    96.9    155.0     251.7     96.7       0.0   2.6    873.2    872.4    295.44            825.54       379    0.780    629M   430K       0.0       0.0
 L28    187/187  2.91 GB   0.0    203.2   102.1    101.1     203.2    102.0       0.0   2.0    813.0    812.6    256.00            677.66       399    0.642    507M   128K       0.0       0.0
 L29      0/0    0.00 KB   0.0    304.8   116.7    188.1     304.5    116.4       0.0   2.6    808.3    807.7    386.07           1001.15       343    1.126    761M   447K       0.0       0.0
 L30      0/0    0.00 KB   0.0    306.8   121.7    185.1     306.6    121.5       0.0   2.5    887.8    887.4    353.81            993.57       280    1.264    766M   238K       0.0       0.0
 L31      0/0    0.00 KB   0.0    455.3   131.8    323.5     454.8    131.3       0.0   3.5    850.7    849.8    548.04           1477.33       219    2.502   1138M   965K       0.0       0.0
 L32      0/0    0.00 KB   0.0    313.6   114.8    198.8     313.2    114.5       0.0   2.7    787.2    786.4    407.87           1046.96       108    3.777    783M   667K       0.0       0.0
 L33   1200/1200 18.83 GB   0.0    270.3    99.0    171.2     269.9     98.7       0.0   2.7    829.0    827.8    333.84            872.59        48    6.955    675M   791K       0.0       0.0
 L34   2616/2616 41.07 GB   0.0    373.2   150.1    223.1     372.5    149.4       0.0   2.5    994.2    992.2    384.38           1177.81        27   14.236    933M  1689K       0.0       0.0
 L35  15301/15301 240.39 GB   0.0   1716.1   259.1   1457.0    1697.4    240.4       0.0   6.6    852.1    842.8   2062.27           6321.49        22   93.740   4539M    46M       0.0       0.0
 L36      0/0    0.00 KB   0.0     15.2     0.0     15.2      27.6     12.4       0.0 2103.8    631.0   1147.4     24.63             86.06         1   24.634     69M    585       0.0       0.0
 L39  55055/0   867.37 GB   0.0    890.9    27.6    863.3     867.4      4.1     238.7  31.4    335.2    326.3   2721.62           9388.96         1 2721.624   4000M    32K       0.0       0.0
 Sum  74479/19421  1.15 TB   0.0   6405.4  1522.5   4882.9    6699.7   1816.8     238.7  20.5    506.3    529.6  12954.12          29106.61     37019    0.350     18G    52M       0.0       0.0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0   6405.4  1522.5   4882.9    6372.5   1489.6       0.0   0.0    604.8    601.7  10845.32          27836.22     15617    0.694     18G    52M       0.0       0.0
High      0/0    0.00 KB   0.0      0.0     0.0      0.0     327.2    327.2       0.0   0.0      0.0    158.9   2108.81           1270.39     21402    0.099       0      0       0.0       0.0

Uptime(secs): 2973.3 total, 20.0 interval
Flush(GB): cumulative 327.177, interval 3.014
Cumulative compaction: 6699.69 GB write, 2307.33 MB/s write, 6405.42 GB read, 2205.98 MB/s read, 12954.1 seconds
Interval compaction: 42.75 GB write, 2184.91 MB/s write, 39.79 GB read, 2033.79 MB/s read, 77.0 seconds
Stalls(count): 708 level0_slowdown, 708 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 268 memtable_slowdown, interval 16 total count
Block cache LRUCache@0x7f3618e28290#1898157 capacity: 210.00 GB collections: 2 last_copies: 3 last_secs: 7.19322 secs_since: 2364
Block cache entry stats(count,size,portion): DataBlock(12527444,96.49 GB,45.9464%) FilterBlock(70176,4.17 GB,1.98532%) IndexBlock(134124,7.23 GB,3.44461%) Misc(16,121.17 KB,5.50279e-05%)

** File Read Latency Histogram By Level [default] **

** DB Stats **
Uptime(secs): 2973.3 total, 20.0 interval
Cumulative writes: 816M writes, 816M keys, 59M commit groups, 13.8 writes per commit group, ingest: 323.00 GB, 111.24 MB/s
Cumulative WAL: 816M writes, 0 syncs, 816296033.00 writes per sync, written: 323.00 GB, 111.24 MB/s
Cumulative stall: 00:01:59.793 H:M:S, 4.0 percent
num-running-compactions: 6
num-running-flushes: 2

Partial output from a script that monitors what is in the database directory. The script does this every 3 seconds:

    b_gb=$( ls -l $DB_DIR 2> /dev/null | grep blob | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
    s_gb=$( ls -l $DB_DIR 2> /dev/null | grep sst | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
    l_gb=$( ls -l $WAL_DIR 2> /dev/null | grep log | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
    a_gb=$( ls -l $DB_DIR 2> /dev/null | awk '{ c += 1; b += $5 } END { printf "%.1f", b / (1024*1024*1024) }' )
    du1_gb=$( du -hs -BG $DB_DIR 2> /dev/null | awk '{ print $1 }' )
    du2_gb=$( du -hs -BG --apparent-size $DB_DIR 2> /dev/null | awk '{ print $1 }' )
    dbbpid=$( pidof db_bench )
    nf1=$( ls $DB_DIR | wc -l )
    nf1S=$( ls $DB_DIR | grep sst | wc -l )
    nf1L=$( ls $DB_DIR | grep log | wc -l )
    nf1T=$( ls $DB_DIR | grep trash | wc -l )
    nf2=$( lsof -p $dbbpid | grep REG | wc -l )
    ts=$( date +%H%M%S )
    echo -e "${a_gb}\t${s_gb}\t${l_gb}\t${b_gb}\t${du1_gb}\t${du2_gb}\t${nf1}\t${nf1S}\t${nf1L}\t${nf1T}\t${nf2}\t${ts}"

The output below shows the database directory reaches ~7TB from SST files.

Legend:
* tot-sz - total size of database directory (GB)
* sst-sz - size of SST files (GB), per ls
* log-sz - size of *.log files (GB), per ls
* blob-sz - size of blob files (GB), per ls
* du1-sz - size of database directory (GB) per du
* du2-sz - size of database directory (GB) per du --apparent-size
* num-f - number of files in database directory
* num-s - number of SSTs in database directory
* num-l - number of *.log files in database directory
* num-t - number of trash files in database directory
* lsof - number of open REG files for db_bench, per lsof
* time - timestamp as HHMMSS

tot-sz  sst-sz  log-sz  blob-sz du1-sz  du2-sz  num-f   num-s   num-l   num-t   lsof    time
6452.5  6441.7  0.1     0.0     6459G   6460G   399818  399870  7       0       353694  162127
6500.7  6490.3  0.0     0.0     6507G   6507G   402713  402750  7       0       354660  162151
6550.6  6538.3  0.1     0.0     6557G   6557G   405819  405866  5       0       355715  162215
6599.0  6587.0  0.0     0.0     6606G   6606G   408818  408833  6       0       356820  162241
6654.3  6639.3  0.0     0.0     6661G   6662G   412352  412408  5       0       357888  162306
6710.1  6697.6  0.0     0.0     6717G   6718G   415784  415804  5       0       360077  162329
6763.6  6749.5  0.1     0.0     6771G   6772G   419193  419246  4       0       360980  162355
6824.3  6809.3  0.1     0.0     6830G   6831G   422882  422936  4       0       363957  162420
6875.0  6862.0  0.1     0.0     6883G   6883G   426225  426244  7       0       364862  162445
6937.2  6921.1  0.0     0.0     6945G   6945G   429925  429988  4       0       367690  162512
6995.7  6982.4  0.0     0.0     7004G   7005G   433664  433701  4       0       369381  162538
7050.3  7039.0  0.0     0.0     7057G   7057G   436879  436923  5       0       385782  162604
7093.1  7081.1  0.0     0.0     7100G   7100G   439690  439744  4       0       387251  162632
7144.7  7132.8  0.0     0.0     7150G   7149G   442781  442797  5       0       390733  162659
7187.1  7175.3  0.0     0.0     7194G   7195G   445652  445695  5       0       391608  162725
7239.0  7227.9  0.0     0.0     7243G   7243G   448672  448706  6       0       395869  162754
7274.9  7265.4  0.1     0.0     7280G   7281G   451128  451166  4       0       397864  162819
7312.4  7303.0  0.0     0.0     7318G   7319G   453595  453630  6       0       400019  162845
7350.9  7343.0  0.1     0.0     7356G   7356G   455960  455989  6       0       401454  162913
7382.7  7374.9  0.0     0.0     7389G   7388G   458012  458056  4       0       416763  162941
7223.1  7408.1  0.1     0.0     7090G   7054G   435703  434914  9       0       409882  163007
5812.4  6072.5  0.1     0.0     5732G   5699G   353339  352166  9       0       336315  163030
4771.7  5020.7  0.1     0.0     4685G   4647G   288341  286658  10      0       268862  163049
3756.9  3938.8  0.1     0.0     3704G   3680G   229150  228282  9       0       214843  163105
2917.6  3037.0  0.1     0.0     2874G   2855G   184359  184342  9       0       172238  163117
2293.8  2387.3  0.1     0.0     2236G   2219G   147041  145988  9       0       142646  163129
2191.5  2185.6  0.0     0.0     2194G   2194G   146577  146586  2       0       144174  163140
2220.8  2214.4  0.0     0.0     2223G   2223G   148278  148278  3       0       145885  163151
2248.7  2242.1  0.0     0.0     2251G   2251G   149918  149931  3       0       146501  163204
2276.9  2270.4  0.0     0.0     2279G   2279G   151558  151569  6       0       148687  163215
2305.1  2298.9  0.0     0.0     2307G   2308G   153279  153288  4       0       149455  163227
2335.5  2328.2  0.1     0.0     2338G   2338G   155102  155110  4       0       151678  163239
2363.2  2356.0  0.0     0.0     2366G   2367G   156767  156781  4       0       152698  163251
2397.2  2389.9  0.1     0.0     2400G   2400G   158685  158695  4       0       153607  163304
2425.1  2420.1  0.0     0.0     2427G   2427G   160372  160379  6       0       156117  163316
2452.9  2446.2  0.0     0.0     2455G   2455G   161982  161992  6       0       157266  163328
2481.5  2475.2  0.0     0.0     2484G   2484G   163572  163585  3       0       157855  163340
2511.8  2505.4  0.0     0.0     2515G   2514G   165116  165119  6       0       158368  163352
2538.8  2532.4  0.1     0.0     2542G   2542G   166442  166459  5       0       161064  163405
2572.0  2565.3  0.0     0.0     2576G   2575G   168425  168431  7       0       162030  163417
2605.9  2598.0  0.1     0.0     2610G   2609G   170489  170504  4       0       163244  163429
2638.7  2631.6  0.0     0.0     2642G   2641G   172291  172301  7       0       166125  163442
2665.6  2659.2  0.1     0.0     2670G   2668G   173492  173499  5       0       166802  163455
2699.8  2691.1  0.0     0.0     2705G   2705G   175462  175464  5       0       167658  163509
2739.3  2730.8  0.1     0.0     2743G   2742G   177814  177830  4       0       169050  163522
2772.6  2765.4  0.0     0.0     2776G   2776G   179935  179951  5       0       172912  163536
2802.6  2795.6  0.1     0.0     2806G   2806G   181672  181687  4       0       174264  163549
2836.4  2829.1  0.0     0.0     2840G   2839G   183708  183726  5       0       175223  163602
2873.2  2863.1  0.1     0.0     2878G   2878G   185920  185934  4       0       176299  163617
2914.6  2905.9  0.1     0.0     2919G   2919G   188479  188504  5       0       177991  163631

mdcallag avatar Aug 16 '22 23:08 mdcallag

To reproduce this I start with this patch:

diff --git a/db/db_impl/db_impl_files.cc b/db/db_impl/db_impl_files.cc
index da87aa5cd..e011dad50 100644
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@@ -526,6 +526,10 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
                number >= state.min_pending_output;
         if (!keep) {
           files_to_del.insert(number);
+        } else {
+          ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme live %d c1=%d c2=%d", (int) number,
+                         sst_live_set.find(number) != sst_live_set.end(),
+                         number >= state.min_pending_output);
         }
         break;
       case kBlobFile:
@@ -602,9 +606,15 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
     // = -1, then no need to delete or schedule delete these files since they
     // will be removed by their owner, e.g. the primary instance.
     if (!own_files) {
+      if (type == kTableFile) {
+        ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme notown %d", (int) number);
+      }
       continue;
     }
     if (schedule_only) {
+      if (type == kTableFile) {
+        ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme scheduleonly %d", (int) number);
+      }
       InstrumentedMutexLock guard_lock(&mutex_);
       SchedulePendingPurge(fname, dir_to_sync, type, number, state.job_id);
     } else {
diff --git a/db/version_set.cc b/db/version_set.cc
index 9d37db84f..af68f8b06 100644
--- a/db/version_set.cc
+++ b/db/version_set.cc
@@ -6109,6 +6109,7 @@ void VersionSet::GetObsoleteFiles(std::vector<ObsoleteFileInfo>* files,
     }
   }
   obsolete_files_.swap(pending_files);
+  ROCKS_LOG_INFO(db_options_->info_log, "Keepme getobsolete %" PRIu64 " min_pending, %d of %d moved\n", min_pending_output, (int) files->size(), (int) (files->size() + obsolete_files_.size()));

   std::vector<ObsoleteBlobFileInfo> pending_blob_files;
   for (auto& blob_file : obsolete_blob_files_) {

And then I run db_bench like this. The problems occur during overwrite.

numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --universal_compression_size_percent=-1 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1660684521 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_fillseq.wal_disabled.v400.log.r.csv

numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --duration=7200 --threads=32 --merge_operator="put" --seed=1660689764 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_overwrite.t32.s0.log.r.csv

Then I grep LOG:

grep -E 'table_file_del|compaction_finished|Keepme' /data/m/rx/LOG

In these examples overwrite starts at 15:42:46. From the previous comment the database directory reaches ~7TB between 162604 and 163007. There is a table_file_deletion event at 15:43:26

2022/08/16-15:43:26.280134 1898167 EVENT_LOG_v1 {"time_micros": 1660689806280130, "job": 1104, "event": "table_file_deletion", "file_number": 67046}

But then no more until 16:29:50, which explains why the database directory reaches 7TB.

2022/08/16-16:29:50.371093 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371087, "job": 821, "event": "table_file_deletion", "file_number": 479984}
2022/08/16-16:29:50.371173 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371171, "job": 821, "event": "table_file_deletion", "file_number": 479983}

From the logging I added to GetObsoleteFiles, the number of files in obsolete_files_ reaches 317006. None are moved to files (meaning they can be deleted) because min_pending_output=67303 from 15:43:18 until 16:29:49.

2022/08/16-15:43:18.560721 1898169 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 55 moved
...
2022/08/16-16:29:49.554543 1898170 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.627389 1898172 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.802757 1898164 [/version_set.cc:6112] Keepme getobsolete 480013 min_pending, 317006 of 319190 moved

I assume this is the long running compaction that blocked things (kept min_pending_output at 67303) because compaction_time_micros ~= 2721 seconds and total_output_size ~= 867GB

2022/08/16-16:29:50.002195 1898164 (Original Log Time 2022/08/16-16:29:49.265941) EVENT_LOG_v1 {"time_micros": 1660692589249330, "job": 821, "event": "compaction_finished", "compaction_time_micros": 2721624360, "compaction_time_cpu_micros": 9388960476, "output_level": 39, "num_output_files": 55055, "total_output_size": 931328757382, "num_input_records": 4000032144, "num_output_records": 4000000000, "num_subcompactions": 4, "output_compression": "LZ4", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 80, 192, 443, 2616, 15301, 0, 0, 0, 55055]}

Grepping the compaction IO stats output during overwrite, from the Sum column the LSM size doesn't exceed 1.23TB. Assuming the worst-case of 2X transient space-amp (all files compacted concurrently) the database directory shouldn't exceed ~2.46TB.

mdcallag avatar Aug 17 '22 00:08 mdcallag

For my sake, the call stack to delete an SST after compaction ends is this. However, that depends on the call to FindObsoleteFiles finding files that can be deleted, and FindObsoleteFiles depends on GetObsoleteFiles finding the files which doesn't happen as explained above. MinObsoleteSstNumberToKeep is also relevant.

#0  0x00000000006a029d in rocksdb::DeleteScheduler::DeleteFile (this=this@entry=0x7ffff76d8440, file_path=..., dir_to_sync=..., force_bg=force_bg@entry=false) at file/delete_scheduler.cc:63
#1  0x00000000006aba6d in rocksdb::SstFileManagerImpl::ScheduleFileDeletion (this=0x7ffff76d8380, file_path=..., path_to_sync=..., force_bg=<optimized out>) at file/sst_file_manager_impl.cc:423
#2  0x00000000006a2209 in rocksdb::DeleteDBFile (db_options=db_options@entry=0x7ffff76e9330, fname=..., dir_to_sync=..., force_bg=force_bg@entry=false, force_fg=<optimized out>) at file/file_util.cc:123
#3  0x00000000005435ad in rocksdb::DBImpl::DeleteObsoleteFileImpl (this=0x7ffff76e9000, job_id=6, fname=..., path_to_sync=..., type=rocksdb::kTableFile, number=15) at db/db_impl/db_impl_files.cc:357
#4  0x0000000000549385 in rocksdb::DBImpl::PurgeObsoleteFiles (this=this@entry=0x7ffff76e9000, state=..., schedule_only=schedule_only@entry=false) at db/db_impl/db_impl_files.cc:608
#5  0x000000000053797e in rocksdb::DBImpl::BackgroundCallCompaction (this=0x7ffff76e9000, prepicked_compaction=0x0, bg_thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2984
#6  0x000000000053812e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2670
#7  0x00000000007c13a8 in std::function<void ()>::operator()() const (this=0x7ffff69fd010) at /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/std_function.h:688
#8  rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x7ffff7630200, thread_id=0) at util/threadpool_imp.cc:319
#9  0x00000000007c1743 in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x7ffff7667f70) at util/threadpool_imp.cc:360
#10 0x00007ffff7ea9661 in std::execute_native_thread_routine (__p=0x7ffff76b2880) at ../../../.././libstdc++-v3/src/c++11/thread.cc:80
#11 0x00007ffff7db720c in start_thread (arg=0x7ffff69fe700) at pthread_create.c:479
#12 0x00007ffff7b7616f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

mdcallag avatar Aug 17 '22 00:08 mdcallag

On the bright side, this is explained in the wiki.

To prevent it from happening, we take use of a good feature that new files are created using the file name of an incremental number. Before each flush or compaction job runs, we remember the number of latest SST file created at that time. If a full garbage collection runs before the job finishes, all SST files with number larger than that will be kept. The condition is released after the job is finished. Since multiple flush and compaction jobs can run in parallel, all SST files with number larger than number the earliest ongoing job remembers will be kept. It possible that we have some false positive, but they will be cleared up eventually.

mdcallag avatar Aug 17 '22 13:08 mdcallag

This has compaction IO stats from the end of overwrite for tests that use buffered IO and then O_DIRECT for compaction. The results are from RocksDB version 6.20

First from a server in the Google cloud with 30 CPUs (hyperthread disabled), 22 client threads, 64G RAM and 3TB of EBS "persistent SSD" storage with max_background_jobs=8. I reduced the size of the database in half (loads 2B KV pairs) to avoid the out of space error. The results are here.

  • the value for Comp(sec) is much larger than CompMergeCPU(sec) because the compaction threads wait a lot on IO
  • target_file_size_base is 16MB, bytes_per_sync is 8MB, compaction_readahead_size is 2MB. For future tests I might need to reduce bytes_per_sync to 1MB or 2MB.

From the buffered IO result:

  • For L38 there was 1 compaction, Write(GB)=93.7, Comp(sec)=820.93 or ~114 MB/s

From the O_DIRECT result:

  • for L38 there were 3 compactions, Write(GB)=226.4, Comp(sec)=1616.09 or ~140 MB/s

From a faster server - 256G RAM, 80 HW threads (40 CPUs with hyperthread enabled) and fast NVMe direct attached SSD the results are here.

Looking at the stats for L38 with buffered IO:

  • 1 compaction, Write(GB)=222.3, Comp(sec)=223.67 or 993 MB/s

And with O_DIRECT

  • 1 compaction, Write(GB)=321.3, Comp(sec)=991.23 or 324 MB/s

mdcallag avatar Aug 22 '22 23:08 mdcallag