Potential Vulnerability in Update Sequence During Slot Migration

Open RiversJin opened this issue 10 months ago • 1 comments

In the tryUpdateMigrationStatus logic, after a successful migration, the controller first updates the topology in the kvrocks cluster via SetSlot, then updates the etcd data using clusterStore.UpdateCluster. However, there appears to be a potential vulnerability here:

If SetSlot succeeds but a network failure prevents UpdateCluster from completing (or worse, if SetSlot partially succeeds with some nodes updated and others failing), this could leave the kvrocks cluster version higher than the cluster version stored in etcd.
Even if another controller takes over from the failed one, it would be unable to execute new operations because subsequent "versions" issued by the controller would no longer exceed the version already in kvrocks.

To address this, I suggest reversing the order: update etcd first, then propagate the new topology to kvrocks.
As a side note, if this approach is adopted, the clusterx set slot command in kvrocks might become redundant, as we could fully update the post-migration topology using setnodes instead.

Feb 19 '25 08:02 RiversJin

@RiversJin Thanks for your investigation, we did have this issue, and changing the order could mitigate it. Another enhancement would be allowed to update the store(ETCD) once the version in Kvrocks is higher than the controller.

Feb 19 '25 09:02 git-hulk