ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
This fixes two bugs in shutdown logic, in the zookeeper server.
- The
SendAckRequestProcessormay die when attempting to close itsLearnerowner's socket, to signal that something went wrong, if the learner already closed the socket because something (the same thing) went wrong (namely, the leader disconnecting). This is fixed by simply checking for nullity. ZooKeeperServer.shutdown(boolean)is not present in child classes, so many uses here fail to properly shut down child resources, such as theSyncRequestProcessor. This is fixed by refactoring shutdown for the child classes.
A unit test is also added, that fails when either of the two fixes are not present.
To be precise, it fails only because the SyncRequestProcessor is never shut down (thread leak), once the first fix is applied; I didn't spend more time looking for other weird failures that may arise from what is obviously a bug anyway.
See ZOOKEEPER-4541 for full details.
@hanm I believe you review a related PR earlier, so perhaps you're the right reviewer here as well?
I don't know why I cannot add @hanm as "reviewer", btw I hope he will receive the notification
LGTM
Great work. Did you experiment this fix in some test environment ?
Not yet, but we may do that before we merge, if you wish. We will find a way to run a patched 3.8.0, probably early next week.
Thanks for the quick reply!
We've had this running for almost a week now, without any issues, and the data inconsistencies have not been observed. The sample size isn't large enough to conclude, though :)
Anyway, we saw some other digest mismatches, and I started digging around for their cause. I found one problem introduced with this commit, fixed in 8121711.
The problem was a COMMIT between NEWLEADER (which flushes the packetsNotCommited) and UPTODATE would crash the learner, which would peek at this queue and expect entries in it. This is fixed by not clearing the packetsNotCommited on NEWLEADER; instead, the already written entries are simply skipped when updating the log after UPTODATE.
Working on the above, I also found the fix for reconfig between NEWLEADER and UPTODATE, in this commit, to be incomplete: since the packetsNotCommited is no longer emptied after NEWLEADER, the head doesn't change, and if there are other PROPOSALs between the NEWLEADER and the PROPOSAL that the COMMITANDACTIVATE is meant for, then reconfig still doesn't happen. The unit test added back then was insufficient to test this. My fix to this is in f24eb51, and is to simply traverse the packetsNotCommitted, looking for a matching Zxid. I can't imaging this being a performance issue.
:warning: 52 God Classes were detected by Lift in this project. Visit the Lift web console for more details.
The symptom of the bug fixed in the first of these two commits is that the learner crashes with an NPE during sync, and then, when it restarts, it typically writes a duplicated series of transactions to its transaction log, complains about that, and later may observe a digest mismatch when replaying that transaction log from file, during startup.
Ah, I think we raced -- I just minted a PR to resolve ZOOKEEPER-4394 which is the problem you described. I'm happy with either solution, but perhaps you can also take a look at my PR?
We raced indeed :) Given that this PR addresses some additional concerns, I'd vote for it to be merged.
Actually, your version is better. I'm incorporating that here instead, if you don't mind? I still think it's a good idea to ensure the pending writes are actually flushed, before ack'ing the NEWLEADER, both because we should ensure they're on persistent storage before ack'ing, and also because that reduces the otherwise random order of ACKs the leader would observe, when SyncRequestProcessor.run races with Learner.syncWithLeader.
Hmm, no, this isn't quite right either, although none of the tests fail. Not sure why the rendezvous-with-sync-thread didn't work, and that's probably the right way to do this. It could be just insufficient test-setup, of course.
Hmm, no, the code has a race, as it is now. The LeaderHandler expects the first ACK after starting a DIFF sync to be the NEWLEADER ack, but if there are lots of PROPOSALs in the diff, before the NEWLEADER, then this may also cause an ACK to be sent, which will crash the Leader.wantForNewLeaderAck.
Ensuring the transactions are indeed flushed (through the usual request processor pipeline) guarantees these ACKs, and thus always crashes the leader. Meh.
There ... Not pretty, but seems like the only way to make this right is by making it possible to delay the ACKs that would otherwise be sent once the SyncRequestProcessor flushes, so they arrive after the ACK og the NEWLEADER.
This is implemented by adding additional special requests that can be enqueued by the sync processor—one to rendezvous with it, from the syncing thread at startup, to ensure TXNs are actually flushed; and two to toggle whether to delay forwarding to the SendAckRequestProcessor, used by the FollowerZooKeeperServer.
FWIW, your recent commit looks good to me.
Another possible approach is to bypass the request processor pipeline entirely like what was done in https://github.com/apache/zookeeper/pull/1848.
Hmm, that other PR doesn't look right to me. It fails to store the TXNs that aren't already committed before ACKing the NEWLEADER, which was what ZOOKEEPER-3911 was all about, in the first place. Agree?
(Pushed again after fixing a complaint from code analysis, about an always true condition.)
We've had this (the last commit, see referencing PR) running across hundreds of clusters for some time now, with more than a thousand rolling restarts concurrently with read and write traffic, and not detected any further issues.
Also, it seems @hanm is no longer active? @eolivelli would you like a different reviewer as well, or are you satisfied? @jeffrey-xiao seems capable, imo :)
Tagging a few reviewers who can be interested in this patch @anmolnar @lvfangmin @maoling @symat @nkalmar @fpj
I think that it will be worth to port this patch to all the active branches once we commit it to master branch. we should also cut a release
Hehe, yes, it's too complex. I'll try to extract that sync state state to a helper class, with descriptive method names. That should hopefully clarify the meaning of those two queues. And, yes, we need them both. I won't offer an explanation here, but rather hope the next commit makes that superfluous.
Regarding the isPreZAB1_0: that must have been broken for some time as well—there's no NEWLEADER in the diff there, so we never start the ZK server (after that code was split and moved into the NEWLEADER handling. I can remove it in a separate commit.
So ... you may be right we don't need to keep all this auxiliary structure during sync, but I believe we need to if we want to precisely preserve today's behaviour (except what's needed to fixi those bugs, obviously). What complicates the sync is that some transactions aren't logged and ack'ed; and possibly also that state is used while syncing. I can't say whether the first behaviour is needed, and whether the second is actually the case, without a much deeper dive into all of this, but I do see tests failing when I change this behaviour. Perhaps it is a good idea to first fix these bugs, and then possibly look for ways to simplify?
Any further thoughts on this @breed, @eolivelli ?
I could add that after patching with these commits (not the last one, which should be purely refactoring), we've had zero issues with inconsistent ZK clusters. This is across several hundred thousand rolling cluster restarts. Previously, we typically had one or two broken clusters each week, and had to intervene manually in each case, i.e., a one-in-thousand chance of breaking across a restart.
As another data point, we're also running multiple clusters with weekly restarts and have not seen issues with inconsistent ZK clusters with this patch. I am very interested in getting this merged and was wondering what's left to push this PR through?
IMO, this is pretty high priority bug to fix because it requires manual intervention to recover from. Otherwise, the cluster is in a permanently inconsistent state.
As a meta note, I wonder if it's worth splitting this PR into multiple PRs because it fixes distinct bugs (ZOOKEEPER-4409, ZOOKEEPER-4502, ZOOKEEPER-4394, ZOOKEEPER-4541). Likely not worth the effort, but perhaps it will get through review more easily that way ;)
Oof, would probably have been a good idea, but one thing led to another, and now it'd be a lot of work to split this 😬 🙂
@eolivelli any thoughts on what to do next here? I think it would be good to conclude this work soon. It's getting a bit stale :)
Hmm, that other PR doesn't look right to me. It fails to store the TXNs that aren't already committed before ACKing the NEWLEADER, which was what ZOOKEEPER-3911 was all about, in the first place. Agree?
// ZOOKEEPER-3911: make sure sync the uncommitted logs before commit them (ACK NEWLEADER).
Given that, I would say #1445 itself is misleading or we are misunderstanding the jira part of ZOOKEEPER-3911, or both. It should sync only txns in DIFF sync, all proposals in this phase are considered committed by new leader. I think ZOOKEEPER-4394 already/almost listed the point: NEWLEADER is not appended intermediately/atomically after proposals in DIFF sync. I guess we are not aware of this by the time of #1445.
These ongoing(committed or not) proposals are simply beyond of discussion, these belongs to broadcast phase(See Zab, Zab1.0). I considered them as gap between paper and implementation. Steps in paper are mostly like atomic, but implementations are not.
Back to this pr, the good part is that it is verified in production. The bad part is that it is giant and mixed several issues and areas.
-
ZOOKEEPER-4409
NullPointerExceptioninSendAckRequestProcessor -
ZOOKEEPER-4502
SyncRequestProcessorleaks inZooKeeperServer::shutdown. -
Drop preZab1.0. This might make this pr not cherry-pick-able, as it could break clients to 3.7.x and 3.8.x. Though, I think it was probably broken by ZOOKEEPER-3104 long before.
-
DIFFsyncs are not persisted. There are several issues and prs.- ZOOKEEPER-3023: Flaky test
Zab1_0Test.testNormalFollowerRunWithDiff. This predates ZOOKEEPER-3911, basically we can consider #1848 as fix to ZOOKEEPER-2678. - ZOOKEEPER-3911: Data inconsistency caused by DIFF sync uncommitted log. #1445 tried, but failed, so we are here.
- ZOOKEEPER-4394:
Learner.syncWithLeadergotNullPointerException. I think this is the symptom of #1445 and laggingNEWLEADER. Targeting a fix to avoidNullPointerExceptionis going wrong direction from my point of view. #1930 will fix theNullPointerExceptionbut no more than that. - ZOOKEEPER-4646(Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk), ZOOKEEPER-4685(Unnecessary system unavailability due to Leader shutdown when follower sent ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery). I was impressed by the analysis. #1993 is the ongoing effort. But i think it was misled by #1445, hence does not fix ZOOKEEPER-4394.
- This pr. To be honest, I don't like this part of this pr. It tied up
synchronizationandbroadcastphase. Personally, I prefer a synchronous approach as #1848 and #1993 adopted, I think they are clean and straightforward. Though, they have cons/drawbacks as #1993 pointed out.
In long term, I would suggest to send
NEWLEADERjust afterDIFFsync, this aligns to the paper. But it won't fix this persistent issue. - ZOOKEEPER-3023: Flaky test
-
Refactor. I like this part, but it may increase review burden.
Personally, I would suggest to fix above issues separately. There are other possible issues in synchronization phase.
- ZOOKEEPER-4643: Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk
NEWLEADERis not sent intermediately afterDIFFsyncs. This diverges from the paper. ZOOKEEPER-4394 pointed out this.- It is not guaranteed/asserted in
LearnerHandlerthat the firstACKis targetingNEWLEADER. ZOOKEEPER-4685 pointed out this. - Unmatched commits are causing
log.warns insynchronizationphase, while they cause system exit inFollowerZooKeeperServer.commit.
IMO, this is pretty high priority bug to fix because it requires manual intervention to recover from. Otherwise, the cluster is in a permanently inconsistent state.
I think it would be good to conclude this work soon. It's getting a bit stale :)
Hmm, some of my prs #1820(merged), #1859(approved) and #1847 are all almost one year from its creation. I also opened apache/bookkeeper#3041(data loss, 1 year), apache/pulsar#7490(data duplication, 3 years) and google/leveldb#375(data inconsistency, 3 years until google/leveldb#339 got merged). I guess we should believe in time 😮💨.
Oof, would probably have been a good idea, but one thing led to another, and now it'd be a lot of work to split this 😬 🙂
Maybe we can start from fresh fixes ? Anyway, it may not be a pleasure process 😨 😵💫. I believed that getting pr concentrated is helpful to get merged. People might fear of giant pr, at least, I was hesitated (part of this) to get involved till today.
Could you please take a look at this and alternatives #1848 and #1993 ? @eolivelli @breed @cnauroth @hanm @nkalmar @ztzg @anmolnar @tisonkun @li4wang
Is this PR officially getting superseded by #2152 and #2154?
Not for me to decide, but I'd think so.
Superseded by #2111, #2152 and #2154.