Possible split brain on migration race condition
Search before asking
- [x] I had searched in the issues and found no similar issues.
Version
https://github.com/apache/kvrocks/blob/9c9ffb82de7fd70e02ae0244faf05677d27f53b0/src/cluster/slot_migrate.cc#L424-L445
In finishSuccessfulMigration, the code first set SUCCESS on dest node then the local cluster marks the slot range on itself as migrated.
However, if the dest node has some network issue and we have an error for setImportStatusOnDstNode, then it is possible the dest node actually successfully processed setImportStatusOnDstNode and put the slot in imported_slots_.However, the local cluster does not mark the slot as migrated.
In this case, the old node would be able to serve read. And the new node would be able to server read and write: https://github.com/apache/kvrocks/blob/9c9ffb82de7fd70e02ae0244faf05677d27f53b0/src/cluster/cluster.cc#L926-L930
And if clients are sending traffic to the two nodes at the same time (e.g. because some clients have old slot mapping), then it is possible the clients hitting old node may see stale value. In the worst situation, if controller/operator sets the node slots to old value, all writes to the dest node will be lost?
In https://github.com/apache/kvrocks/blob/9c9ffb82de7fd70e02ae0244faf05677d27f53b0/src/cluster/cluster.cc#L926-L930, should we only return OK if the connection is in ASKING mode?
Minimal reproduce step
N/A
What did you expect to see?
N/A
What did you see instead?
N/A
Anything Else?
No response
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!
@zhixinwen Thanks for your report.
should we only return OK if the connection is in ASKING mode?
And yes, I think it makes sense to allow the destination node to serve the slot only if the source node redirects the request.
@git-hulk Thanks for your response.
What would happen when master becomes slave after master calls setImportStatusOnDstNode(*dst_fd_, kImportSuccess)? To me, looks like the new master would have none as migration_state_, and the controller would keep the old topology. However, the dest node would still have imported_slots_ set and it could potentially receive some writes MOVED from old source node. In this case, some writes would still be lost?
Let's take a step back, do we really need imported_slots_?
SetClusterNodes will clear imported_slots_ anyway. It is possible that a master-slave failover happened in the cluster for some other nodes and it triggered controller to call SetClusterNodes. migrationLoop has not update the migration status yet. In this case, imported_slots_ would be cleared even if it should not anyway.
However, the dest node would still have imported_slots_ set and it could potentially receive some writes MOVED from old source node. In this case, some writes would still be lost?
For this case, it will fail to write instead of losing data. Because the source node will redirect the write request to the destination node, but the cluster information hasn't been updated yet.
Let's take a step back, do we really need imported_slots_?
I think yes. There is a time window between updating the cluster slots and finishing the migration status. The destination node cannot serve the imported slot if it doesn't have the imported status.
Because the source node will redirect the write request to the destination node
Agree
but the cluster information hasn't been updated yet.
I wonder if there would be any case the cluster info will never be updated and the old topology is kept. Master that is being migrated becomes slave can be such a case.
If it is possible, then some writes may be redirected to a node that would not own the slot and we have write loss? I feel the issue is before controller decides a node owns a slot, the node does not really own it. Therefore any write of the slot to the new node can potentially cause issue.
I wonder if there would be any case the cluster info will never be updated and the old topology is kept. Master that is being migrated becomes slave can be such a case.
I think yes in this case, the new writes will be lost if the source node promoted a new replica.
Thanks for confirming. I will look deeper into it and see what we can do. I feel we would need to block both source and dst node until controller confirms the topology change. I will need to run some experiment to see how bad it would affect the migration process.