garnet Fix cluster failover missed replicas

Currently when a CLUSTER FAILOVER is executed in a cluster with multiple replicas (e.g. https://microsoft.github.io/garnet/docs/cluster/replication) only the former primary node is added as a replica to the node we fail-over to. This change makes sure all replicas are updated correctly with the new primary.

Mar 26 '24 19:03 martinivanov

@microsoft-github-policy-service agree

Mar 26 '24 19:03 martinivanov

Can you please provide a repro of the bug that this fix is addressing?

Mar 26 '24 20:03 vazois

Sure! Here is the output of CLUSTER NODES before the fix on a cluster created following this guide:

127.0.0.1:7000> CLUSTER NODES
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 5 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 4 connected

After issuing CLUSTER FAILOVER on node 2 (10.211.55.4:7001):

127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu master - 0 0 1 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave da88dbfc35607e64744a01fffefc0c0a1c9f154a 0 0 7 connected

We now have 2 masters - 7001 and the former 7000 (7000 is unusable though as it redirects all commands to 7001). 7002's primary changed to 7001 as expected.

Now here is the same after the fix: 7000:

127.0.0.1:7000> CLUSTER NODES
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 4 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 5 connected

7001:

127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 7 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 8 connected

All replicas, including the former primary are following 7001 as expected.

All of this works with clusters with 2 nodes or a sharded cluster with a single replica for each shard. Having more than 2 nodes breaks the cluster state afte failover.

Mar 27 '24 11:03 martinivanov

Sure! Here is the output of CLUSTER NODES before the fix on a cluster created following this guide:

127.0.0.1:7000> CLUSTER NODES
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 5 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave 35a88121db47cbc3cd4b9547db3c95af1d359bbf 0 0 4 connected

After issuing CLUSTER FAILOVER on node 2 (10.211.55.4:7001):

127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
da88dbfc35607e64744a01fffefc0c0a1c9f154a 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
35a88121db47cbc3cd4b9547db3c95af1d359bbf 10.211.55.4:7000@17000,ubuntu master - 0 0 1 connected
4406a9c737c7a1db26779ebd51c25f35db34cac2 10.211.55.4:7002@17002,ubuntu slave da88dbfc35607e64744a01fffefc0c0a1c9f154a 0 0 7 connected

We now have 2 masters - 7001 and the former 7000 (7000 is unusable though as it redirects all commands to 7001). 7002's primary changed to 7001 as expected.

Now here is the same after the fix: 7000:

127.0.0.1:7000> CLUSTER NODES
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu myself,master - 0 0 1 connected 0-16383
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 4 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave 07ee347c05ef247632c904b037d75468341eadae 0 0 5 connected

7001:

127.0.0.1:7001> CLUSTER FAILOVER
OK
127.0.0.1:7001> CLUSTER NODES
cd8c327df91ffa650f01073fbb0fe9a4142e80cb 10.211.55.4:7001@17001,ubuntu myself,master - 0 0 6 connected 0-16383
07ee347c05ef247632c904b037d75468341eadae 10.211.55.4:7000@17000,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 7 connected
bf3ee567c4a897cbb396b7617393558e2a5d3fbe 10.211.55.4:7002@17002,ubuntu slave cd8c327df91ffa650f01073fbb0fe9a4142e80cb 0 0 8 connected

All replicas, including the former primary are following 7001 as expected.

All of this works with clusters with 2 nodes or a sharded cluster with a single replica for each shard. Having more than 2 nodes breaks the cluster state afte failover.

Thank you for identifying the issue. As a good practice for the future, try to open an issue first before creating a PR so we have a history of all the discussions. Just to provide some context regarding, this is a gap in the supported feature. Right now, we assume that the primary is down, and failover is called on one of the replicas with TAKEOVER option without any leader election. In that case, we make the replica that is taking over a primary and assign to it the old primary's slots. We do not update anything else in the configuration because we cannot guarantee that the remote nodes (including old primary) are reachable. If they are not reachable then we would have updated the config and provide an inconsistent view to the other nodes in the cluster that are reachable. We will update the configuration for the old replicas (and primary currently missing) by issuing a gossip round and REPLICAOF command, so the remotes node can ack that they have become replicas. Based on this scenario, the current implementation is not appropriate. There is another issue related to the attachReplicas method, which prevents communication to all reachable remoted nodes when any of them is unreachable. The latter is an existing bug and is manifesting in the unit tests after your change because the primary is down. I have opened an issue regarding this and will take a stab at it soon. I am leaving the PR open and you are welcome to continue contributing towards a solution :)

Mar 28 '24 18:03 vazois

Closing this because the provided solution introduces regressions.

Apr 06 '24 02:04 vazois

garnet garnet copied to clipboard

Fix cluster failover missed replicas

garnet
garnet copied to clipboard