garnet
garnet copied to clipboard
Fix cluster failover missed replicas
Currently when a CLUSTER FAILOVER is executed in a cluster with multiple replicas (e.g. https://microsoft.github.io/garnet/docs/cluster/replication) only the former primary node is added as a replica to the node we fail-over to. This change makes sure all replicas are updated correctly with the new primary.
@microsoft-github-policy-service agree
Can you please provide a repro of the bug that this fix is addressing?
Sure! Here is the output of CLUSTER NODES before the fix on a cluster created following this guide:
127.0.0.1:7000> CLUSTER NODES
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 5 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 4 connected
After issuing CLUSTER FAILOVER on node 2 (10.211.55.4:7001):
127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu master - 0 0 1 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave da88dbfc35607e64744a01fffefc0c0a1c9f154a 0 0 7 connected
We now have 2 masters - 7001 and the former 7000 (7000 is unusable though as it redirects all commands to 7001). 7002's primary changed to 7001 as expected.
Now here is the same after the fix: 7000:
127.0.0.1:7000> CLUSTER NODES
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 4 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 5 connected
7001:
127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 7 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 8 connected
All replicas, including the former primary are following 7001 as expected.
All of this works with clusters with 2 nodes or a sharded cluster with a single replica for each shard. Having more than 2 nodes breaks the cluster state afte failover.
Sure! Here is the output of
CLUSTER NODESbefore the fix on a cluster created following this guide:127.0.0.1:7000> CLUSTER NODES 35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383 da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 5 connected 4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 4 connectedAfter issuing
CLUSTER FAILOVERon node 2 (10.211.55.4:7001):127.0.0.1:7001> CLUSTER FAILOVER OK 127.0.0.1:7001> CLUSTER NODES da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383 35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu master - 0 0 1 connected 4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave da88dbfc35607e64744a01fffefc0c0a1c9f154a 0 0 7 connectedWe now have 2 masters - 7001 and the former 7000 (7000 is unusable though as it redirects all commands to 7001). 7002's primary changed to 7001 as expected.
Now here is the same after the fix: 7000:
127.0.0.1:7000> CLUSTER NODES 07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383 cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 4 connected bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 5 connected7001:
127.0.0.1:7001> CLUSTER FAILOVER OK 127.0.0.1:7001> CLUSTER NODES cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383 07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 7 connected bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 8 connectedAll replicas, including the former primary are following 7001 as expected.
All of this works with clusters with 2 nodes or a sharded cluster with a single replica for each shard. Having more than 2 nodes breaks the cluster state afte failover.
Thank you for identifying the issue. As a good practice for the future, try to open an issue first before creating a PR so we have a history of all the discussions. Just to provide some context regarding, this is a gap in the supported feature. Right now, we assume that the primary is down, and failover is called on one of the replicas with TAKEOVER option without any leader election. In that case, we make the replica that is taking over a primary and assign to it the old primary's slots. We do not update anything else in the configuration because we cannot guarantee that the remote nodes (including old primary) are reachable. If they are not reachable then we would have updated the config and provide an inconsistent view to the other nodes in the cluster that are reachable. We will update the configuration for the old replicas (and primary currently missing) by issuing a gossip round and REPLICAOF command, so the remotes node can ack that they have become replicas. Based on this scenario, the current implementation is not appropriate. There is another issue related to the attachReplicas method, which prevents communication to all reachable remoted nodes when any of them is unreachable. The latter is an existing bug and is manifesting in the unit tests after your change because the primary is down. I have opened an issue regarding this and will take a stab at it soon. I am leaving the PR open and you are welcome to continue contributing towards a solution :)
Closing this because the provided solution introduces regressions.