zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

ZOOKEEPER-4925: Fix data loss due to propagation of discontinuous committedLog

Open kezhuw opened this issue 6 months ago • 1 comments

There are two variants of ZooKeeperServer::processTxn. Those two variants diverge significantly since ZOOKEEPER-3484. processTxn(Request request) pops outstanding change from outstandingChanges and adds txn to committedLog for follower to sync in addition to what processTxn(TxnHeader hdr, Record txn) does. The Learner uses processTxn(TxnHeader hdr, Record txn) to commit txn to memory after ZOOKEEPER-4394, which means it leaves committedLog untouched in SYNCHRONIZATION phase.

This way, a stale follower will have hole in its committedLog after joining cluster. The stale follower will propagate the in memory hole to other stale nodes after becoming leader. This causes data loss.

The test case fails on master and 3.9.3, and passes on 3.9.2. So only 3.9.3 is affected.

This commit drops processTxn(TxnHeader hdr, Record txn) as processTxn(Request request) is capable in SYNCHRONIZATION phase too.

Also, this commit rejects discontinuous proposals in syncWithLeader and committedLog, so to avoid possible data loss.

Refs: ZOOKEEPER-4925, ZOOKEEPER-4394, ZOOKEEPER-3484

Reviewers: li4wang Author: kezhuw Closes #2254 from kezhuw/ZOOKEEPER-4925-fix-data-loss

(cherry picked from commit e5dd60bf0512ccc1e090d99410a8da48623219da)

kezhuw avatar Jun 10 '25 17:06 kezhuw

This is the 3.9.4 backport of #2254. I have backported it to branch-3.9. cc @tisonkun

kezhuw avatar Jun 10 '25 17:06 kezhuw

You might want to rebase the patch to re-trigger the failing tests.

anmolnar avatar Jul 21 '25 13:07 anmolnar