Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race
Summary
This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.
Issues
When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (RDB_CHILD_TYPE_SOCKET) for replication.
Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in REPL_STATE_RECEIVE_PING_REPLY on their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.
If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:
- Time out after a while (
server.repl_syncio_timeoutis 5s by default) and resume its main thread, reconnecting with other nodes - Receive fresh PONG reply from the new leader.
- Then receive delayed, stale PING messages from the new leader buffered before failover (didn't reach replica earlier, because replica was "dead").
- Misinterpret that stale message as current truth, think it's now a sub-replica, and decide to follow the old primary again, and—after detecting it as FAILED—start and win a new election.
This results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.
For a concrete example, read the test case test_blocked_replica_stale_state_race and its comment in file tests/unit/cluster/replica-migration.tcl.
Overall there're two issues:
- Replicas waiting for replication reply will block on main thread and stop all activities. This is true even with rdb-key-save-delay = 0, because the underlying cause is replication enrollment timing, not artificial delay.
- The blocked replica can go through the events above and become empty primary.
This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.
I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.
Fix
There could be different approaches to solving the problem. The way I do it is:
- Replica shouldn't trigger failover when replication offset is 0.
- Replica has a failed primary. Then it receives PING/PONG packet from a sender claiming it's the primary server in the same shard. Replica should accept and follow it.
File tests/unit/cluster/replica-migration.tcl is mostly meant to test the logic in function clusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 72.40%. Comparing base (d16788e) to head (e3a98ce).
:warning: Report is 15 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #2811 +/- ##
============================================
- Coverage 72.45% 72.40% -0.05%
============================================
Files 128 128
Lines 70485 70497 +12
============================================
- Hits 51068 51042 -26
- Misses 19417 19455 +38
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/cluster_legacy.c | 87.58% <100.00%> (+0.04%) |
:arrow_up: |
| src/replication.c | 85.82% <ø> (-0.08%) |
:arrow_down: |
| src/socket.c | 94.21% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@PingXie @enjoy-binbin @hpatro Friendly ping 👀 Could you take a look at this when you got some time?