mysql-5.6 icon indicating copy to clipboard operation
mysql-5.6 copied to clipboard

Flush memtable and compact L0 files for read only workloads

Open mdcallag opened this issue 7 years ago • 11 comments

The state of the LSM tree (data in memtable, #files in L0) is a significant source of variance for performance on read-only and read-heavy tests. This is much worse for read-only than for read-mostly workloads. Data in the memtable and files in L0 are overhead for read-only/mostly tests. When the memtable isn't full and when there aren't too many files in L0 then the LSM can stay in that state for a long time because there are no writes to trigger it to change. This is an issue for real workloads and for benchmarks.

I lost a few days debugging this problem recently. I know our users will also be confused by this and some will be disappointed by MyRocks performance. Part of the problem is that synthetic benchmarks are synthetic, but we can't get the world to stop running sysbench.

Can we solve this in two parts?

  1. Give me an option to flush memtable and then remove all files from L0. This must be dynamic. Perhaps there is an option to flush the memtable today. Temporarily setting the L0 compaction trigger to 0 lets me flush the L0, but I doubt that is supported today.
  2. Most users won't know to do things I list in step 1, and they shouldn't have to know them. Can we make MyRocks and/or RocksDB adaptive and figure out it should flush the memtable and L0 when a test is read-only and sometimes when it is extremely read-heavy?

Someone suggested that CompactRange(nullptr, nullptr) flushes memtables and compacts L0. For step 1 we need a way to trigger that via SQL.

mdcallag avatar Nov 15 '16 15:11 mdcallag

With sysbench read-only range query tests I get 5% to 20% more QPS when the memtable and level 0 is flushed. Details at https://gist.github.com/mdcallag/68052cdd36fe122354bc23ec337fb986

mdcallag avatar Jan 18 '17 16:01 mdcallag

@siying suggested that MyRocks can for now issues a Flush() and compact all L0 files using CompactFiles() API. but a better solution will be to bring back read triggered compactions from leveldb

IslamAbdelRahman avatar Jan 19 '17 00:01 IslamAbdelRahman

Siying pointed out that for MyRocks to know list of L0/L1 files from a CF, DB::GetColumnFamilyMetaData() can be used. Find files from there and call DB::CompactFiles() with those files.

yoshinorim avatar Jan 19 '17 00:01 yoshinorim

We have rocksdb_compact_cf variable to rebuild specific CF. We can add another command variable to flush MemTable and L0/L1 (or SQL function).

yoshinorim avatar Jan 19 '17 00:01 yoshinorim

Since RocksDB has block-cache, is it necessary to flush memtable and compact L0 sst files for read-only queries? Because there are only at most 4 sst files in L0.

zhangjinpeng87 avatar Jan 19 '17 06:01 zhangjinpeng87

@mdcallag

zhangjinpeng87 avatar Jan 19 '17 06:01 zhangjinpeng87

While working on this I think I have discovered a bug within RocksDB. My proposed patch is here. https://gist.github.com/alxyang/df534c195bf9fd8c516c57ab4fb5f610

The code changes are relatively straightforward and just get the column family meta data and call compactFiles().

Stacktrace: (looks like there is some issue with calling CompactFiles when there is a CompactionFilter enabled, the SuperVersion is not set correctly) https://gist.github.com/alxyang/76efe0158eb217694264277b3289ce2b

@IslamAbdelRahman @siying do you mind taking a look? Can just pull the latest myrocks, apply my patch via git apply --stat alex.patch, and run to trigger the crash.

alxyang avatar Jan 23 '17 23:01 alxyang

Thanks @alxyang, it looks like an issue in rocksdb that happen when CompactionFilter is issued with CompactFiles() and the CompactionFilter is using DB::Get() inside. We are working on a fixing it

IslamAbdelRahman avatar Jan 24 '17 00:01 IslamAbdelRahman

@zhangjinpeng1987 if the data working set is so small that it all cached in memory, RocksDB will lose B-tree because both read data from memory without compression, but RocksDB needs merge data from multiple levels, L0 files and memtables. In order to improve read-only workload, we can compact memtables and L0 files so we have less data to merge.

siying avatar Jan 24 '17 00:01 siying

One variant of this is an option to flush the memtable and merge L0 to L1 after at most N seconds when workload is read-mostly. The worst case is a read-only workload where the memtable flush and L0->L1 merge triggers are never reached. So all queries spend extra CPu going through the memtable and L0, when it would be better to pay the small/one-time cost of doing memtable flush and L0->L1 merge.

mdcallag avatar Jul 20 '18 01:07 mdcallag

Other things that can help if done adaptively by RocksDB depending on the workload are reducing the L0 trigger so there are fewer L0 SSTs, reducing memtable size. I think this is feasible and hope for progress. Otherwise I will continue mis-reporting perf results for MyRocks and RocksDB and I don't always have the time to fix my mistakes. But the real winners will be users.

mdcallag avatar Jan 06 '21 20:01 mdcallag