valkey Fix data loss when the old primary takes over the slots after online

There is a race in clusterHandleConfigEpochCollision, which may cause the old primary node to take over the slots again after coming online and cause data loss. It happens when the old primary and the new primary have the same config epoch, and the old primary has a smaller node id and win the collision.

In this case, the old primary and the new primary are in the same shard, we are not sure which is strictly the latest. To prevent data loss, now in clusterHandleConfigEpochCollision we will let the node with the larger offset win the conflict.

In addition to this change, when a node increments the config epoch throught conflicts, or CLUSTER FAILOVER TAKEOVER, or CLUSTER BUMPEPOCH, we will send PONGs to all ndoes to allow the cluster to reach consensus on the new config epoch more quickly.

This also can closes #969.

Aug 31 '24 11:08 enjoy-binbin

Here is the logs:

The old primary:

# The old primary, restarting
2024-08-30T05:51:13.1503179Z ### Starting test Restarting the previously killed master nodes in tests/unit/cluster/manual-takeover.tcl

# Get configEpoch collision with other primary (including the new primary) and the old primary won the collision.
2024-08-30T05:51:13.1932876Z 14312:M 30 Aug 2024 05:50:20.994 * Connection with replica 127.0.0.1:28624 lost.
2024-08-30T05:51:13.1934608Z 14312:M 30 Aug 2024 05:50:20.994 * FAIL message received from 4260b224913d8b1d0dcf27a909983ac412d171f0 () about 77296d415c4b47c081ef5c99a1720061717e06de ()
2024-08-30T05:51:13.1936882Z 14312:M 30 Aug 2024 05:50:20.994 * FAIL message received from 4260b224913d8b1d0dcf27a909983ac412d171f0 () about c0bc5fd54657c5b445405d4ac18b84345959fbcc ()
2024-08-30T05:51:13.1939066Z 14312:M 30 Aug 2024 05:50:21.000 * configEpoch collision with node 6bb7d855c2b596cab135cbabcf1903ea20d2699a (). configEpoch set to 8
2024-08-30T05:51:13.1946224Z 14312:M 30 Aug 2024 05:50:21.003 * configEpoch collision with node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 (). configEpoch set to 9

# Sending a UPDATE, and make the new primary become a replica.
2024-08-30T05:51:13.1982806Z 14312:M 30 Aug 2024 05:50:21.004 - Node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 has old slots configuration, sending an UPDATE message about 30e8a59b00dfefddb91659aa15cc4d24da0c00f7
2024-08-30T05:51:13.1987920Z 14312:M 30 Aug 2024 05:50:21.004 - Client closed connection id=10 addr=127.0.0.1:43115 laddr=127.0.0.1:28629 fd=37 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=37760 events=r cmd=NULL user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=0 tot-net-out=0 tot-cmds=0
2024-08-30T05:51:13.2031232Z 14312:M 30 Aug 2024 05:50:21.007 - Node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 has old slots configuration, sending an UPDATE message about 30e8a59b00dfefddb91659aa15cc4d24da0c00f7
2024-08-30T05:51:13.2033814Z 14312:M 30 Aug 2024 05:50:21.007 * Node 15a7b11e4dbfbefe354a41ed39e535694d6c8058 () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.

# The new primary become a replica
2024-08-30T05:51:13.2036446Z 14312:M 30 Aug 2024 05:50:21.014 * Replica 127.0.0.1:28624 asks for synchronization
2024-08-30T05:51:13.2039410Z 14312:M 30 Aug 2024 05:50:21.014 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'e3dde8f6addf6deed58fd8f1a91024f399850238', my replication IDs are '480e5eaf42d14b1037c43eb39525c965cfd4d752' and '0000000000000000000000000000000000000000')
2024-08-30T05:51:13.2042187Z 14312:M 30 Aug 2024 05:50:21.014 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
2024-08-30T05:51:13.2043751Z 14312:M 30 Aug 2024 05:50:21.015 * Background RDB transfer started by pid 16253 to pipe through parent process
2024-08-30T05:51:13.2045418Z ### Starting test Instance #0, #1, #2 gets converted into a slaves in tests/unit/cluster/manual-takeover.tcl

The new primary (using a failover takeover, and became a replica at the last)

# The new primary (the replica which use the FAILOVER TAKEOVER)
2024-08-30T05:51:14.2595326Z 14142:S 30 Aug 2024 05:50:17.310 * Taking over the primary (user request).
2024-08-30T05:51:14.2596355Z 14142:S 30 Aug 2024 05:50:17.310 * New configEpoch set to 8
2024-08-30T05:51:14.2597130Z 14142:M 30 Aug 2024 05:50:17.310 * Connection with primary lost.
2024-08-30T05:51:14.2612166Z 14142:M 30 Aug 2024 05:50:17.310 * Caching the disconnected primary state.
2024-08-30T05:51:14.2613391Z 14142:M 30 Aug 2024 05:50:17.310 * Discarding previously cached primary state.
2024-08-30T05:51:14.2615593Z 14142:M 30 Aug 2024 05:50:17.310 * Setting secondary replication ID to 480e5eaf42d14b1037c43eb39525c965cfd4d752, valid up to offset: 1354. New replication ID is e3dde8f6addf6deed58fd8f1a91024f399850238

# delete some noise

# The old primary online again
2024-08-30T05:51:14.2800774Z 14142:M 30 Aug 2024 05:50:20.995 - Node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 has old slots configuration, sending an UPDATE message about 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4
2024-08-30T05:51:14.2803401Z 14142:M 30 Aug 2024 05:50:20.999 * Node 15a7b11e4dbfbefe354a41ed39e535694d6c8058 () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.
2024-08-30T05:51:14.2805843Z 14142:M 30 Aug 2024 05:50:21.003 * Node 6bb7d855c2b596cab135cbabcf1903ea20d2699a () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.
2024-08-30T05:51:14.2808216Z 14142:M 30 Aug 2024 05:50:21.009 * Clear FAIL state for node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 (): primary without slots is reachable again.

# The old primary win the collision and the new primary are losing its slot and become a replica.
2024-08-30T05:51:14.2810907Z 14142:M 30 Aug 2024 05:50:21.010 * Configuration change detected. Reconfiguring myself as a replica of node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 () in shard d5800922644aa5418d1d64a243612a1265c0d042
2024-08-30T05:51:14.2877461Z 14142:S 30 Aug 2024 05:50:21.010 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
2024-08-30T05:51:14.2892822Z 14142:S 30 Aug 2024 05:50:21.010 * Connecting to PRIMARY 127.0.0.1:28629

Aug 31 '24 11:08 enjoy-binbin

Codecov Report

:x: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 70.47%. Comparing base (fea49bc) to head (b395193). :warning: Report is 798 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	50.00%	5 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #974      +/-   ##
============================================
- Coverage     70.50%   70.47%   -0.04%     
============================================
  Files           114      114              
  Lines         61742    61750       +8     
============================================
- Hits          43532    43519      -13     
- Misses        18210    18231      +21

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`85.99% <50.00%> (+<0.01%)`	:arrow_up:

... and 10 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aug 31 '24 11:08 codecov[bot]