rocksdb
rocksdb copied to clipboard
Creating and dropping column families has gotten much slower
Expected behavior
Creating and dropping column families should be fast.
Actual behavior
It used to be fast in version 4.1-4.9, but has since gotten a lot slower. I've tested this across several versions and it looks like it has gotten increasingly worse.
AFAICT the behaviour is not entirely linear either. Here's some timings that've gotten from 5.17.2:
100 column families: 10.7 seconds 200 column families: 40.5 seconds 400 column families: 2 minutes 42 seconds
Steps to reproduce the behavior
I've created a simple Maven project that lets one reproduce the problem easily: https://github.com/grove/bugs-and-regressions/tree/master/rocksdb/column-families-create-drop
Clone the repository and run "mvn clean test" in the rocksdb/column-families-create-drop directory. This will run the unit test with RocksDB 5.17.2, and it will take around 40 seconds to complete (creates 200 column families, adds one key/value pair, looks it up again and then drops the column families).
If you change the version number in the pom.xml to be 4.1.0, then the test will complete in around 3 seconds.
Another thing I've observed is that if you create and then drop a column family, i.e and do not create others in the meantime, then it is fast.
We've got a similar problem - we're creating thousands of column families, and the performance is simply abysmal. We're having to look at removing support for column families, as it looks like they're simply not fit for purpose.
@grove, that 10.7 seconds for 100 column families sounds about right, and it was not this slow before indeed.
I have similar problems with RocksDB 6.14.6 on a Windows 10 machine: creating 19 column families takes almost a second (800 ms). Looking at a profiler output it seems that most time is spent in rocksdb::DBImpl::WriteOptionsFile
, which also seem to reparse (?) the options file, and horrible times are spent around there. So the actual column family creation part still seem to be fast, here is a quick breakdown of the main elements of the pie chart:
||||||||||||||||||| + rocksdb::DB::Open 100%
|||||||||||||||||||| + rocksdb::DBImpl::Open 100%
||||||||||||||||||||| + rocksdb::DBImpl::CreateColumnFamily 86% of total 100%
|||||||||||||||||||||| - rocksdb::DBImpl::WriteOptionsFile 77% of total 100%
|||||||||||||||||||||| - rocksdb::DBImpl::CreateColumnFamilyImpl 8% of total 100%
Checking the size of the created temporary file OPTIONS-000005.dbtmp
yields 9159 bytes. Checking this should not be this slow.
It would be really great if someone could look at this in more detail, because it is becoming a very burning issue on my end as well. Thanks.
I will take a look.
I will take a look.
Thanks. If you solve this, I owe you a beer. Or two.
I believe the performance degradation started when validation of the Options file was introduced. The performance problem is that every time the Options file is written, it is read back in and validated. This validation is done by re-creating the ColumnFamilyOptions and comparing them to the expected ones. This operation is what is taking the time. Note that every ColumnFamily is validated on every operation -- not just the ones added or removed.
PR #9756 made some performance improvements to the re-creating of the ColumnFamilyOptions (25% faster?). If "paranoid_checks" is set to false, the Options file is not validated, which also will save a great deal of time (at the expense of other validations as well).
Performing ColumnFamily additions in bulk -- rather than individually -- is faster if that is an option for you because there are fewer validations required.
There may be other optimizations that are possible, but are more complex/involved. For example, only "deeply" validate options for column families that have changed (rather than for every one). Or compare the existing options file to the new one and only validate the column families that have changed. A new option could also be introduced to skip the options file validation while allowing other validations to occur.
@mrambacher, thanks for the update. I am wondering: does the above, what you just described, also influence DB Open times? So is it possible that e.g. you have rocksdb_options_set_create_if_missing
and rocksdb_options_set_create_missing_column_families
flags set to true, all your column families are already created, and you call rocksdb_open_column_families
-- yet parts of this Options file validation mechanism kicks in again more than once slowing down DB Open?
Just a quick remark: rocksdb::LoadLatestOptions
seems to be also affected. So if above is solved, then LoadLatestOptions should be also way faster, I'm guessing.
We've recently found a similar regression in our test suite after upgrading rocksdb from 6.4 to 6.29. The cpu time time spent dealing with options (which seems to affect creating/opening the db and altering options) seems to have increased 2.9x times in our test suite.
I've been running into this problem with our test-suite that creates and drops many (100s, a few for every test) of column families to isolate test execution and drops them again after a test completes.
For a "silly fix" one can patch the rocksdb code to not write the config file every time a column family is created or dropped:
diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index c834927bb..babed13b5 100644
--- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -2698,8 +2698,8 @@ Status DBImpl::CreateColumnFamily(const ColumnFamilyOptions& cf_options,
assert(handle != nullptr);
Status s = CreateColumnFamilyImpl(cf_options, column_family, handle);
if (s.ok()) {
- s = WriteOptionsFile(true /*need_mutex_lock*/,
- true /*need_enter_write_thread*/);
+ //s = WriteOptionsFile(true /*need_mutex_lock*/,
+ // true /*need_enter_write_thread*/);
}
return s;
}
@@ -2852,8 +2852,8 @@ Status DBImpl::DropColumnFamily(ColumnFamilyHandle* column_family) {
assert(column_family != nullptr);
Status s = DropColumnFamilyImpl(column_family);
if (s.ok()) {
- s = WriteOptionsFile(true /*need_mutex_lock*/,
- true /*need_enter_write_thread*/);
+ //s = WriteOptionsFile(true /*need_mutex_lock*/,
+ // true /*need_enter_write_thread*/);
}
return s;
}
In my case this reduces the test harness runtime from tens of minutes down to 30s -- but it's obviously not a great fix for anything that's not running tests and relies on the option file being there for the future...
As for the cause, I've been profiling this a little with perf, and it seems parsing the config file is very expensive. When I run the test-suite the main time of it is spent parsing and verifying the config file.
I also tried to just disable the verification of the config file by uncommenting the VerifyRocksDBOptionsFromFile
calls, but that's not enough (it's still very slow). Here is a corresponding perf flamegraph that still writes and reads the config file but doesn't verify:
I'm not familiar enough with the rocksdb code-base to identify and fix the root cause. If anyone can maybe point me in the right direction, I'd be happy to give it another try though.
Does this problem have been solved?
Does this problem have been solved?
No adding and removing column families is still very slow in rocksdb.
see also: https://github.com/facebook/rocksdb/pull/11920/files#diff-3778fac04ebd470104a6731fffe0d2dcf93950a23b20297165a35643fe9c3d2bR365