kvrocks-controller icon indicating copy to clipboard operation
kvrocks-controller copied to clipboard

Improve the migration API

Open PragmaTwice opened this issue 1 year ago • 4 comments

The input parameter should be slot ranges e.g. slotRanges: ["1", "2-8", "11-22"] instead of slot.

Also we need to explain what's the slotOnly parameter.

PragmaTwice avatar Nov 12 '24 05:11 PragmaTwice

在线等

zhongfengchi avatar Apr 10 '25 09:04 zhongfengchi

Background

Hi, my previous PR #304 enabled support for migrating a single slot range ["2-8"] for example, or a single slot ["1"].

However I'd like to be able to queue up a few slot migrations like your original example: ["1", "2-8", "11-22"], which my PR did not support.

Goal

Trying to see how much effort it'd be to implement this functionality.

Thoughts

Please correct me, or let me know if anything i'm saying below is wrong!

Current Implementation

So when we do a slot migration, what happens is

  1. Handler receives a Migrate Slot request
  2. We call cluster.MigrateSlot which does some checks but essentially issues the command to kvrocks to start migrating the slot (or range).
  3. The request execution ends
  4. ClusterChecker which is on it's own go-routine checks each shard for migration. If it is successful, then it updates the migration status of the shard, and also the store (etcd, consul ...)

What I think needs to happen

  1. If we want to support multiple slot ranges, we'd need to make use of the store to save which slot ranges need to migrate.
  2. Have the controller somehow trigger the next migration when the previous one ends with "success"
  3. Support cancelling the migration (will just stop what's queue'd up next, but not what's currently migrating)
  4. Support reconnecting and reading from the store and continuing the migration

Please let me know if there's anything I got wrong or anything that needs to be added. I also haven't started on this yet (or decided to take it on yet), I'm just trying to see how much effort it'd be.

bseto avatar Jun 18 '25 22:06 bseto

@bseto Thanks for your nice post, I believe your analysis is correct. Except for the migration queue, we might need to persist those pending slot ranges in the cluster information. So that we can recover them if the leader is changed.

git-hulk avatar Jun 19 '25 14:06 git-hulk

Ok, so as long as we put the migration queue into the cluster information that'll be fine then.

Some other questions I have:

  1. If the input is ["1", "2-8", "11-22"]. Lets say "2-8" is just about to be finished, do we want to allow the user to cancel the migration, or be able to replace this migration queue? or even append to it?
  2. What happens if we're half way through the queue and we get a failure. Do we want to skip, or automatic retry? Or have this behaviour be something configurable?

Thanks!

bseto avatar Jul 16 '25 22:07 bseto