valkey Fix data loss when replica do a failover with a old history repl offset

Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios:

In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs.
In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs.

In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion.

The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary.

The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections.

Another point worth mentioning, in above cases:

In the ROLE command, the replica status will be handshake, and the offset will be -1.
Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong).
After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0.

Aug 11 '24 08:08 enjoy-binbin

force-push for the DCO.

Aug 12 '24 03:08 enjoy-binbin

Codecov Report

:x: Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 70.58%. Comparing base (7424620) to head (b47148d). :warning: Report is 848 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	78.57%	3 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #885      +/-   ##
============================================
+ Coverage     70.39%   70.58%   +0.18%     
============================================
  Files           112      112              
  Lines         61465    61509      +44     
============================================
+ Hits          43271    43417     +146     
+ Misses        18194    18092     -102

Files with missing lines	Coverage Δ
src/replication.c	`87.02% <100.00%> (-0.11%)`	:arrow_down:
src/server.h	`100.00% <ø> (ø)`
src/cluster_legacy.c	`85.58% <78.57%> (+0.06%)`	:arrow_up:

... and 22 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aug 12 '24 04:08 codecov[bot]

@PingXie thanks for the review! i think i took care of all the comments, please take another look.

Aug 21 '24 03:08 enjoy-binbin