Use C++ RocksDB API to generate an SST file of String/Hash type
Hi, I've reviewed #1301 and #2458. Based on my understanding, the requirement of this task is to wrap a RocksDB API that takes a set of String/Hash-type key-value pairs as input and outputs an SST file compliant with the Kvrocks data structure. Is my understanding correct? @git-hulk
Hi, I've reviewed #1301 and #2458. Based on my understanding, the requirement of this task is to wrap a RocksDB API that takes a set of String/Hash-type key-value pairs as input and outputs an SST file compliant with the Kvrocks data structure. Is my understanding correct? @git-hulk
@yezhizi Yes, exactly.
may I have a try?
@raffertyyu Thanks for your interest. I'm not sure if @yezhizi is working on this.
I'm not working on this, feel free to try it. :) @raffertyyu
@yezhizi Thanks for your feedback. Assigned to @raffertyyu.
I'm so sorry that I'm too busy with work to do this right now. :( Please reassign to someone else. @git-hulk Sorry again for being busy.
Sure.
@PragmaTwice are you working on this? If not @git-hulk I can work on this.
@ymiuraaa Sure, you could take it.
@ymiuraaa Curious how would you implement this.
My team would need this as well. But I feel this issue is not clear about the scope.
First, it is not clear to me if the issue is about a separate tool, a new command or a new mode. Second, it is not clear if it would support clusrer mode.
Just saw https://github.com/apache/kvrocks/issues/294. Looks like it is referring to a separate tool. But would it be harder to maintain because of the slot sharding logic?
Yes, based on #294, #2458, and #2941, this issue seems to be focused on a separate CLI tool, not a new command or mode within kvrocks itself. From what I’ve seen in the repo, rocksdb::SstFileWriter hasn’t been used yet, so this would be the first integration of external SST generation logic into Kvrocks tooling, right?
Also regarding cluster mode:
@zhixinwen has a valid concern I think. slot sharding adds complexity because kvrocks expects keys to be mapped to specific hash slots. So then the current plan I have in mind is:
- Format keys according to kvrocks's internal encoding (namespace:type:key, and namespace:hash:key|field).
- Use somthing like this from redis to determine the slot for each key. The crc16.c file is in the unstable branch of redis, but that doesn’t mean the CRC16 logic itself is unstable or experimental so it should be thoroughly tested.
- Initially support a single-slot-only SST, with an optional --force-slot support.
What do you think about this plan? Also if the input contains multiple slots, what should we do? reject it or output multiple SSTs?
If the implementation is a separate tool, you may need to abstract out the common data format functions so it can be shared between the tool and KVRocks (similar for sharding logic if you choose to do that).
I think the issue with slot is not difference in encoding, but how you output the final SST that can be easily moved into KVRocks instances under cluster mode. For example, you would need to know slot 1-500 is on node 1, and slot 500 -1000 should be on node 2 ect and the SST output should be put into different dir. I do not think --force-slot/single-slot-only is needed, but for extensibility, the tool should be able to accept slot range map in the future.
Hmm true. I agree the main challenge isn’t just key encoding or hashing, but partitioning SSTs according to slot-to-node mapping so each SST can be ingested correctly by a Kvrocks cluster.
So then in that case I should probably abstract out the key encoding and slot hashing into reusable functions so they can be shared between kvrocks and this tool we'll be making. First, we'll start with generating a single SST and log warnings if keys span multiple slots. We can make this tool support a --slot-map option (e.g. JSON input like { "0-5000": "shard1", ... , "0-5n": "shard n"}) that outputs one SST per slot range like #2941 , so that it's ready for ingestion by the correct kvrocks instance. How does that sound?
That would work. Another way to handle this is to generate SST files per slot. And have a way to let KVRocks load only the slots it is responsible for.
The advantage of this design is you can change the number of instances needed without regeneration.