rocksdb
rocksdb copied to clipboard
For WBM flushes to pick largest memtable to flush
Expected behavior
When flushes due to Write Buffer Manager
happen, try to pick the largest memtable (not flushed yet) to flush, instead of the one with smallest SeqNo.
Actual behavior
We found that sometimes RocksDB started to flush due to Write Buffer Manager
reason very frequently, but each flush was only able to write very few entries/bytes, which generated a lot of small SST files and the P99 of both DB put and get degraded quickly
We found this code snippet, https://github.com/facebook/rocksdb/blob/cd577f605948894b51fbaab39d1df03a04dfd70f/db/db_impl/db_impl_write.cc#L1745, it seems like RocksDB always picks the memtables with smallest SeqNo for WBM flush. But sometimes, those memtables could be very small as seen in our env.
Not sure if there are already ways to config RocksDB to pick largest memtables for WBM flush. Please suggest if so. Otherwise, it'd be great to support such flexibility.
Steps to reproduce the behavior
cf write buffer size 100MB db write buffer size 5GB
- keep writing kv paris to CFs with larger SeqNo. for a while, until the total data size gets very close to the db buffer limit
- then writing kv pairs to CFs with smaller SeqNo. and you'd see WFM flushes start to happen very frequently, but each flush only writes a little data