kvrocks
kvrocks copied to clipboard
feat: support migrate slot range[draft]
issues: https://github.com/apache/kvrocks/issues/2355
This draft PR demonstrates how to support migrating slot ranges.
What I Did
1 Migration job - 1 slot range:
- I encapsulated a
SlotRange
structure and changed the migration-related class members from a single slot to a slot range. - Reference: https://github.com/apache/kvrocks/issues/412. Slot migration includes the following phases: start migration, migrate existing data, migrate incremental data, and end migration. In each modified phase, the entire slot range must be completed before moving to the next phase.
TODO
TODO represents the items I hope to discuss.
-
Support multiple slot ranges:
- Current situation: Only one migration job can be performed at a time, and each migration job corresponds to a slot range.
- Possible modification: Allow multiple migration jobs, migrating sequentially or in parallel.
-
Perform multiple migrations consecutively but do not immediately use
setslot
to update the topology, referring to the example in TestSlotRangeMigrate:
t.Run("MIGRATE - Repeat migration cases, but does not immediately update the topology via setslot", func(t *testing.T) {
// Disjoint
require.Equal(t, "OK", rdb0.Do(ctx, "clusterx", "migrate", "114-116", id1).Val())
waitForMigrateSlotRangeState(t, rdb0, "114-116", SlotMigrationStateSuccess)
require.Equal(t, "OK", rdb0.Do(ctx, "clusterx", "migrate", "117-118", id1).Val())
waitForMigrateSlotRangeState(t, rdb0, "117-118", SlotMigrationStateSuccess)
require.Equal(t, "OK", rdb0.Do(ctx, "clusterx", "migrate", "112-113", id1).Val())
waitForMigrateSlotRangeState(t, rdb0, "112-113", SlotMigrationStateSuccess)
errMsg := "Can't migrate slot which has been migrated"
// TODO: Migrating 112-113, but 114-118 is covered and cannot be detected.
// require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "114-116", id1).Err(), errMsg)
// require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "117-118", id1).Err(), errMsg)
// Intersection
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "112", id1).Err(), errMsg)
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "112-112", id1).Err(), errMsg)
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "113", id1).Err(), errMsg)
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "113-113", id1).Err(), errMsg)
// Subset
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "112-113", id1).Err(), errMsg)
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "112-120", id1).Err(), errMsg)
require.ErrorContains(t, rdb0.Do(ctx, "clusterx", "migrate", "110-112", id1).Err(), errMsg)
})
This situation also seems to exist in the original single slot migration, and I am not sure if such operations are reasonable.
migrate A => migrate B => setslot A&B
migrate A => migrate B => migrate A (expected to fail, but allowed to pass) => setslot A&B
// slotrange A-C
migrate A-B => migrate C => setslot A-C
migrate A-B => migrate C => migrate B (expected error, but allowed to pass) => setslot A-C
- More precise migration failure slot range:
The current implementation determines the entire slot range to fail and cleans it up, and the user later migrates the entire slot range again.
Do we want to support a more precise failure range? For example,
[start_slot-fail_slot), [fail_slot, end]
. Users can check the status with commands such ascluster info
and then re-migrate the failed slot range by themselves. (Personally, I think it's a bit cumbersome and error-prone)
Miscellaneous
Other suggestions are welcome, such as more testing, code optimization, better user interaction, etc.