rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

For WBM flushes to pick largest memtable to flush

Open klsince opened this issue 1 year ago • 4 comments

Expected behavior

When flushes due to Write Buffer Manager happen, try to pick the largest memtable (not flushed yet) to flush, instead of the one with smallest SeqNo.

Actual behavior

We found that sometimes RocksDB started to flush due to Write Buffer Manager reason very frequently, but each flush was only able to write very few entries/bytes, which generated a lot of small SST files and the P99 of both DB put and get degraded quickly

We found this code snippet, https://github.com/facebook/rocksdb/blob/cd577f605948894b51fbaab39d1df03a04dfd70f/db/db_impl/db_impl_write.cc#L1745, it seems like RocksDB always picks the memtables with smallest SeqNo for WBM flush. But sometimes, those memtables could be very small as seen in our env.

Not sure if there are already ways to config RocksDB to pick largest memtables for WBM flush. Please suggest if so. Otherwise, it'd be great to support such flexibility.

Steps to reproduce the behavior

cf write buffer size 100MB db write buffer size 5GB

  1. keep writing kv paris to CFs with larger SeqNo. for a while, until the total data size gets very close to the db buffer limit
  2. then writing kv pairs to CFs with smaller SeqNo. and you'd see WFM flushes start to happen very frequently, but each flush only writes a little data

klsince avatar Dec 16 '23 00:12 klsince