Improve slot migration speed and resource consumption using raw key values
Motivation
Currently, kvrocks performs slot migration by converting the data into RESP and then sending it to the destination(#430). We made a little optimizations in #904 to speed up migration.
However, there are still some problems with command-based slot migration:
- we will determine the expire time during the data migration, which makes the data inconsistent (mentioned in the comments of #906)
- for data with an expiration time, if the expiration occurs during the migration process, then we cannot migrate the data and the entire migration process will fail.
Solution
We can use rawkv-based data migration, where we directly send rocksdb's key values data to the target.
With rawkv-based migration, neither the source nor the destination need to judge expire time, which completely solves the data inconsistency problem. It is equivalent to translating the raw key values data from one instance to another.
It can save a lot of CPU by eliminating the need to convert Key-Value to the RESP.
In my tests, it can:
- save up to 2x CPU on the target side
- for small values(100byte), the rawkv-based migration is 2.75 times faster than the command-based migration
Plan
I decomposed this issues into those tasks:
- [x] #1982
- [x] #1989
- [x] #2007
- [x] #2008
- [ ] #2009
@caipengbo Do you mean that we can send the write batch directly instead of the RESP format?
Do you mean that we can send the write batch directly instead of the RESP format?
@git-hulk Yup, we send the write batch directly, and the target calls the rocksdb::Write() interface directly
Do you mean that we can send the write batch directly instead of the RESP format?
@git-hulk Yup, we send the write batch directly, and the target calls the
rocksdb::Write()interface directly
Got it, thanks a lot
I'm interested in this issue, but I'm confused about
With rawkv-based migration, neither the source nor the destination need to judge expire time
Do you mean that we can send the write batch directly instead of the RESP format?
Does it means that: for a key-value pair with expire (k,v,expire), just send the (k,v) and ignore the expire?
Does it means that: for a key-value pair with expire (k,v,expire), just send the (k,v) and ignore the expire?
Yes, it ignore the expire. Let the target determine the expire.
Let the target determine the expire.
Do you means the case like:
- migrate (k,v) to target, and now (k,v) don't have expire
- And then target get a command like 'EXPIRE k 10', now (k,v) have a new expire.
Do you means the case like:
Yes
@caipengbo Can you submit an issue to track this?
@caipengbo Can you submit an issue to track this?
Yeah, I plan to solve this issue after slot batch PR merged @git-hulk
@caipengbo Thanks a lot
@caipengbo PR #1534 has been pending for a while. This feature is valuable since it can help the cluster migration avoid depending on the log data of the write batch except for the performance benefit. I'm very eager to see this feature done, but #1534 has too many conflicts and it's a bit hard to review for now.
So I'm not sure if you're willing to continue working on this feature. If yes, I think we can break down this feature into a few PRs and add more tests like:
- [ ] Add the
ApplyBatchcommand - [ ] Implement the merge iterator for iterating SST keys
- [ ] Implement WAL iterator for migrating by WAL logs
- [ ] Implement the migration by applying raw batch
- [ ] Make the raw batch migration parameters configurable
So I'm not sure if you're willing to continue working on this feature. If yes, I think we can break down this feature into a few PRs and add more tests.
Indeed, the previous code was too much for CR, and I was happy to go ahead and split it up into multiple PRs. @git-hulk
Indeed, the previous code was too much for CR, and I was happy to go ahead and split it up into multiple PRs. @git-hulk
@caipengbo Thank you!
I'm going to restart this work. I tracked some tasks in this issue, and I may change the task names in the future PRs.
I'm going to restart this work. I tracked some tasks in this issue, and I may change the task names in the future PRs.
cool, can also convert those sub-tasks into tracking issues.