raft-rs icon indicating copy to clipboard operation
raft-rs copied to clipboard

Replication progress corruption when node rejoins cluster within same term

Open drmingdrmer opened this issue 1 month ago • 2 comments

Summary

We've identified a replication session isolation bug in raft-rs that can cause infinite retry loops when a node is removed and then re-added to the cluster within the same term. While this doesn't compromise data safety, it leads to significant operational issues including resource exhaustion and nodes unable to catch up with the cluster.

Problem Description

When a node rejoins the cluster, delayed AppendEntries responses from the old membership configuration can corrupt the leader's progress tracking for that node. The leader gets stuck in an infinite retry loop, continuously sending AppendEntries requests that the node rejects.

The issue stems from lack of replication session isolation: The Progress struct has no mechanism to distinguish between different replication sessions. The maybe_update method (src/tracker/progress.rs:136-148) only performs monotonicity checks (matched < n), which cannot prevent corruption from stale responses. When a node rejoins, matched is reset to 0, so any delayed response with index > 0 passes the check and corrupts the new session's progress tracking.

Reproduction Scenario

Timeline | Event                                    | Progress State
---------|------------------------------------------|------------------
T1       | Node C in cluster, term=5                | C: matched=50
         | Leader sends AppendEntries(index=50)     | (network delay)
         |                                          |
T2       | Node C removed from cluster              | C: [deleted]
         | Progress[C] removed from tracker         |
         |                                          |
T3       | Node C rejoins cluster (still term=5)    | C: matched=0 (new)
         | New Progress[C] created                  |
         |                                          |
T4       | Delayed response arrives                 | C: matched=50 ❌
         | {from: C, index: 50, success: true}      | Corrupted!
         | maybe_update(50) returns true (50 > 0)   |
         |                                          |
T5       | Leader sends AppendEntries(prev=50)      |
         | Node C rejects (actual log_index < 50)   |
         | Infinite loop begins                     |

Impact

Data Safety: ✓ Not compromised

  • Commit protocol ensures majority agreement
  • No committed data is lost or corrupted

Operational Impact: ✗ Significant

  • Infinite retry loops consuming CPU and network bandwidth
  • Nodes unable to replicate and catch up with cluster
  • Manual intervention required to recover
  • Reduced cluster redundancy during the issue

Potential Solution

Introduce session versioning in the Progress struct:

pub struct Progress {
    pub matched: u64,
    pub next_idx: u64,
    pub state: ProgressState,
    pub session_version: u64,  // Add session identifier
    // ...
}

impl Progress {
    pub fn maybe_update(&mut self, n: u64, response_version: u64) -> bool {
        // Validate session first
        if response_version != self.session_version {
            return false;  // Reject stale response from old session
        }
        
        let need_update = self.matched < n;
        if need_update {
            self.matched = n;
            self.next_idx = n + 1;
            self.resume();
        }
        need_update
    }
}

The session version would be incremented whenever a node is removed and re-added, ensuring responses from old sessions are rejected.

Additional Resources

Complete analysis and survey of other Raft implementations: https://github.com/drmingdrmer/raft-rejoin-bug/blob/main/analysis/raft-rs.md

drmingdrmer avatar Nov 20 '25 13:11 drmingdrmer

Welcome @drmingdrmer! It looks like this is your first issue to tikv/raft-rs 🎉

ti-chi-bot[bot] avatar Nov 20 '25 13:11 ti-chi-bot[bot]

I'm afraid this is actually a wrong usage. A node rejoins the group using the same ID is the same as a node corrupts its all data and states, which breaks the assumption of this library (and Raft algorithm): data once persisted should not be lost. The correct way is once a node is removed, it should be destroyed. If a node joins a group, it should have a unique ID.

BusyJay avatar Nov 21 '25 04:11 BusyJay