rocksdb Support partitioning memtables

From what I found in the documentation, there can be only a single active memtable. Is there a particular reason for this limitation?

I think it could be beneficial to have multiple active memtables, especially when using a prefix_extractor. Suppose we have a fixed size hashmap for the memtables, and the prefix hash determines which memtable to use. That could not only reduce contention on the memtables, but we would end up with memtables with keyspaces with disjoint prefixes, which should improve the utilization of the prefix bloom filter. This information could then also be used during compaction, because we know that there cannot be any overlap between SST files from different buckets.

What do you think about such an extension? I would be willing to take a stab at this, but I could need a few pointers where I find the relevant places in the code.

Jan 16 '22 15:01 mpoeter

To be more precise, there is one mutable memtable per column family. There can be immutable memtables as well.

We have an existing community contributed feature SstPartitioner that sets boundaries for sorted runs (IIRC to make leveled compaction more efficient on some kinds of data). One could make an option to use the same on memtables, but it would be a significant, risky change. I could see the overhead in applying partitioning could be faster in many cases than the saved layers of skip list search, though certainly not in all cases.

reduce contention on the memtables

Our standard memtable is a lock-free concurrent skip list. Contention is not really an issue.

prefix hash determines which memtable to use

Having a disconnect in key order at every prefix transition is bad for iterators and (thus) bad for building sorted runs (flush, compaction). SstPartitioner is intended to capture higher-level divisions than prefixes.

improve the utilization of the prefix bloom filter

A partitioning layer on top of Bloom filter doesn't save anything. Generally it's most efficient to have a filter cover everything with the same end-of-life (e.g. memtable ending on flush, single SST file ending on deletion, etc.).

Jan 17 '22 19:01 pdillinger

To be more precise, there is one mutable memtable per column family.

Yes, by "active" I actually meant "mutable" I just couldn't think of the correct term. :)

Our standard memtable is a lock-free concurrent skip list. Contention is not really an issue.

I know, but also lock-free data structures can suffer from contention on frequently changed memory locations, but to be fair this never showed up in any of my profiling runs (the key comparison was much more dominant).

Having a disconnect in key order at every prefix transition is bad for iterators and (thus) bad for building sorted runs (flush, compaction).

I am not sure if I understand you correctly, but let me provide very a simple example. Suppose we write the following key/value pairs in this order: (a, 1), (b, 1), (c, 1), (d, 1), (a, 2), (b, 2), (c, 2), (d, 2)

Now suppose that we flush each memtable after 4 elements, so we would end up with these two files:

file1: (a, 1), (b, 1), (c, 1), (d, 1)
file2: (a, 2), (b, 2), (c, 2), (d, 2)

If instead we would have two mutable memtables and select the memtable based on the prefix, we would end up with something like this ([a, c] => memtable1, [b, d] => memtable2):

file1: (a, 1), (a, 2), (c, 1), (c, 2)
file2: (b, 1), (b, 2), (d, 1), (d, 2)

Each file contains only two prefixes instead of four (that's why I think the prefix bloom filter could benefit from this grouping) and a lookup for any of the keys should be more efficient since less files have to be checked.

Obviously this example is highly simplified and contrived, but I hope it can convey the general idea.

Jan 18 '22 11:01 mpoeter

In your contrived example you assume essentially twice as much memory devoted to memtables in the second case as in the first case. If you simply devote the same amount of memory to memtables in the first case, it essentially fixes the problem, and with one L0 file instead of two L0 files, which is better for read amplification and write amplification (compaction triggering).

Jan 20 '22 06:01 pdillinger

Yes, for this particular example that would be true. But if different prefixes occur with different frequency this is no longer the case. Consider for example that we write the following keys: `` If we flush every four elements we would get:

file1: a01, a02, a03, b01
file2: a04, a05, a06, b02
file3: a07, a08, a09, b03
file4: a10, a11, a12, b04

If, for fairness with respect to memory consumption, we double the size of the single memtable we would end up with:

file1: a01, a02, a03, b01, a04, a05, a06, b02
file2: a07, a08, a09, b03, a10, a11, a12, b04

With this partitioning we would get:

file1: a01, a02, a03, a04
file2: a05, a06, a07, a08
file3: a09, a10, a11, a12
file4: b01, b02, b03, b04

I agree that having less files might still be better, but I would like to at least give this a try.

Jan 20 '22 08:01 mpoeter