kvrocks Improve slot migration speed and resource consumption using raw key values

Motivation

Currently, kvrocks performs slot migration by converting the data into RESP and then sending it to the destination(#430). We made a little optimizations in #904 to speed up migration.

However, there are still some problems with command-based slot migration:

we will determine the expire time during the data migration, which makes the data inconsistent (mentioned in the comments of #906)
for data with an expiration time, if the expiration occurs during the migration process, then we cannot migrate the data and the entire migration process will fail.

Solution

We can use rawkv-based data migration, where we directly send rocksdb's key values data to the target.

With rawkv-based migration, neither the source nor the destination need to judge expire time, which completely solves the data inconsistency problem. It is equivalent to translating the raw key values data from one instance to another.

It can save a lot of CPU by eliminating the need to convert Key-Value to the RESP.

In my tests, it can:

save up to 2x CPU on the target side
for small values(100byte), the rawkv-based migration is 2.75 times faster than the command-based migration

Plan

I decomposed this issues into those tasks:

[x] #1982
[x] #1989
[x] #2007
[x] #2008
[ ] #2009

Jan 11 '23 02:01 caipengbo

@caipengbo Do you mean that we can send the write batch directly instead of the RESP format？

Jan 11 '23 02:01 git-hulk

Do you mean that we can send the write batch directly instead of the RESP format？

@git-hulk Yup, we send the write batch directly, and the target calls the rocksdb::Write() interface directly

Jan 11 '23 03:01 caipengbo

Do you mean that we can send the write batch directly instead of the RESP format？

@git-hulk Yup, we send the write batch directly, and the target calls the rocksdb::Write() interface directly

Got it, thanks a lot

Jan 11 '23 03:01 git-hulk

I'm interested in this issue, but I'm confused about

With rawkv-based migration, neither the source nor the destination need to judge expire time

Do you mean that we can send the write batch directly instead of the RESP format？

Does it means that: for a key-value pair with expire (k,v,expire), just send the (k,v) and ignore the expire?

Apr 28 '23 12:04 ZENOTME

Does it means that: for a key-value pair with expire (k,v,expire), just send the (k,v) and ignore the expire?

Yes, it ignore the expire. Let the target determine the expire.

Apr 28 '23 12:04 caipengbo

Let the target determine the expire.

Do you means the case like:

migrate (k,v) to target, and now (k,v) don't have expire
And then target get a command like 'EXPIRE k 10', now (k,v) have a new expire.

Apr 28 '23 12:04 ZENOTME

Do you means the case like:

Yes

May 09 '23 02:05 caipengbo

@caipengbo Can you submit an issue to track this?

May 09 '23 08:05 git-hulk

@caipengbo Can you submit an issue to track this?

Yeah, I plan to solve this issue after slot batch PR merged @git-hulk

May 09 '23 10:05 caipengbo

@caipengbo Thanks a lot

May 09 '23 13:05 git-hulk

@caipengbo PR #1534 has been pending for a while. This feature is valuable since it can help the cluster migration avoid depending on the log data of the write batch except for the performance benefit. I'm very eager to see this feature done, but #1534 has too many conflicts and it's a bit hard to review for now.

So I'm not sure if you're willing to continue working on this feature. If yes, I think we can break down this feature into a few PRs and add more tests like:

[ ] Add the ApplyBatch command
[ ] Implement the merge iterator for iterating SST keys
[ ] Implement WAL iterator for migrating by WAL logs
[ ] Implement the migration by applying raw batch
[ ] Make the raw batch migration parameters configurable

Jan 03 '24 12:01 git-hulk

So I'm not sure if you're willing to continue working on this feature. If yes, I think we can break down this feature into a few PRs and add more tests.

Indeed, the previous code was too much for CR, and I was happy to go ahead and split it up into multiple PRs. @git-hulk

Jan 03 '24 12:01 caipengbo

Indeed, the previous code was too much for CR, and I was happy to go ahead and split it up into multiple PRs. @git-hulk

@caipengbo Thank you!

Jan 03 '24 12:01 git-hulk

I'm going to restart this work. I tracked some tasks in this issue, and I may change the task names in the future PRs.

Jan 03 '24 13:01 caipengbo

I'm going to restart this work. I tracked some tasks in this issue, and I may change the task names in the future PRs.

cool, can also convert those sub-tasks into tracking issues.

Jan 03 '24 13:01 git-hulk