python-rocksdb icon indicating copy to clipboard operation
python-rocksdb copied to clipboard

VectorMemtableFactory broken on 5.3.6

Open EliFinkelshteyn opened this issue 7 years ago • 9 comments

On 5.3.6, opts.memtable_factory = rocksdb.VectorMemtableFactory() doesn't seem to work. It gives the error: Invalid argument: Memtable doesn't concurrent writes (allow_concurrent_memtable_write). But allow_concurrent_memtable_write doesn't seem to be exposed in the API, so it can't be changed.

EliFinkelshteyn avatar Jun 17 '17 22:06 EliFinkelshteyn

@EliFinkelshteyn

I have not tested the code on 5.3.6, but it's ok on 5.4.6 .

Here is my test code

import rocksdb
opts = rocksdb.Options()
opts.memtable_factory = rocksdb.VectorMemtableFactory()

twmht avatar Jun 18 '17 05:06 twmht

I've confirmed this does not work on either 5.3.6 or 5.4.6. Your test code works fine, but if you then attempt to create a db, I'm guessing you'll see the same error I got.

import rocksdb
opts = rocksdb.Options()
opts.memtable_factory = rocksdb.VectorMemtableFactory()
opts.create_if_missing = True
test_db = rocksdb.DB("my_db", opts)

gives

InvalidArgument: Invalid argument: Memtable doesn't concurrent writes (allow_concurrent_memtable_write)

EliFinkelshteyn avatar Jun 18 '17 22:06 EliFinkelshteyn

@EliFinkelshteyn

got it!

this is the bug ( it should not throw such exceptions).

And it seems that allow_concurrent_memtable_write is only implemented in SkipListFactory,

so the following code woks.

import rocksdb
opts = rocksdb.Options()
opts.memtable_factory = rocksdb.SkipListMemtableFactory()
opts.create_if_missing = True
test_db = rocksdb.DB("my_db", opts)

I will fix this in the next release.

thanks

twmht avatar Jun 19 '17 03:06 twmht

@EliFinkelshteyn

I have submited a fix patch to branch dev-issue9

you can try to test whether this can fix your problem.

git clone https://github.com/twmht/python-rocksdb.git -b dev-issue9
cd python-rocksdb
python setup.py install

the test code

opts = rocksdb.Options()
opts.allow_concurrent_memtable_write = False
opts.memtable_factory = rocksdb.VectorMemtableFactory()
opts.create_if_missing = True
test_db = rocksdb.DB("/tmp/test", opts)

thanks

twmht avatar Jun 19 '17 06:06 twmht

I can confirm this fixes the issue. Thanks!

Unfortunately, it doesn't seem to change timings at all in my initial bulk data load as this implies should happen, but I guess YMMV.

Anyway, thanks for the quick fix and for taking over maintenance of this library! One last comment is that it might be worthwhile to change the error message InvalidArgument: Invalid argument: Memtable doesn't concurrent writes (allow_concurrent_memtable_write) to something more explicit like the allow_concurrent_memtable_write option must be false to use your desired memtable_factory to give guidance to anyone running into this in the future.

EliFinkelshteyn avatar Jun 19 '17 09:06 EliFinkelshteyn

@EliFinkelshteyn

thanks! there are more features are coming.

if this patch fixed the problem, please close this issue.

I will look into the other problem you point out, but if you have other problems, please open a new issue for that.

twmht avatar Jun 19 '17 11:06 twmht

Happy to. Figured you'd want to do it with the patch merge, but don't mind doing it here. Thanks again for the quick fix!

EliFinkelshteyn avatar Jun 19 '17 22:06 EliFinkelshteyn

@EliFinkelshteyn

One more question, what does this mean?

it doesn't seem to change timings at all in my initial bulk data load

You may show some expected results including the test code, so that I can look into that.

thanks!

twmht avatar Jun 20 '17 01:06 twmht

Sure, what I mean is the tutorial docs state:

For initial bulk loads the Vector-MemtableFactory makes sense.

That led me to believe I'd see faster times for a large initial bulk load by using Vector-MemtableFactory, but in fact I got identical times using this. Some (very simple) test code:

import rocksdb
import ujson
import random
import string
import time


def generate_ascii_str(length=10):
    return ''.join(random.choice(string.letters + string.digits)
                   for _ in range(length))

dummy_datas = list()
for i in range(1000000):
    dummy_data = \
        {"datum1": "this is a data/blah blah blah/" +
                   generate_ascii_str(10),
         "datum2": "this is a other data/blah blah blah/" +
                   generate_ascii_str(10),
         "description": "this is a description/blah blah blah/" +
                        generate_ascii_str(10)}
    dummy_data_str = ujson.dumps(dummy_data).decode('utf8')
    dummy_datas.append(dummy_data_str)

opts = rocksdb.Options()
# set both of these so we don't keep any logs (we never need them)
opts.wal_ttl_seconds = 0
opts.wal_size_limit_mb = 0
opts.db_log_dir = "LOGS"
opts.compression = rocksdb.CompressionType.zlib_compression
opts.write_buffer_size = 2400000000
opts.allow_concurrent_memtable_write = False
opts.memtable_factory = rocksdb.VectorMemtableFactory()
opts.create_if_missing = True
db = rocksdb.DB("/tmp/test_rocks_db", opts)
begin_time = time.time()
to_write = list()
batch = rocksdb.WriteBatch()
for i, item in enumerate(dummy_datas):
    db.put(bytes(i), bytes(item))
db.write(batch, disable_wal=True)
end_time = time.time()
print "rocksdb creation took:"
print end_time - begin_time

EliFinkelshteyn avatar Jun 21 '17 06:06 EliFinkelshteyn