zookeeper
zookeeper copied to clipboard
Jonmv/zookeeper 4541 take 2
This PR fixes the shutdown errors that were added in #157, and also avoids a common NPE during ZK shutdown from a learner, when the leader shuts down (first commit). Together with #2111 and #2152, this should cover all the fixes in #1925.
We've had the forked ZK from #1925 running embedded in hundreds, if not thousands, of ZK clusters, with rolling restarts most days, and we've had zero cases of inconsistent data since we patched—one or a few cases per week before that.
(We still sometimes see ephemeral nodes remain after the leader is brutally taken down, i.e., with Runtime.halt()
, but this looks different; it seems clearing out client sessions, and their ephemeral nodes, simply isn't done when death is too sudden.)
LGTM.
Together with #2111 and #2152, this should cover all the fixes in #1925.
Yes I believe so.
BTW, could u please recommit this pr, since it wasn't built successfully.
Will this fix be backported in the 3.8 train? We just hit this bug on one of our clusters, it's a shame we've had various fixes up for review for over 20 months, I would appreciate you guys pushing this through the finish line so this critical issue can be closed. Thanks!
One more thing, the resources (Threads/Processors...) created in startupWithServerState(State.INITIAL)
won't be released in shutdown
, cause of canShutdown
does not contain the condition state == State.INITIAL
. This LEAK would occur before ZooKeeperServer.state
changes to State.RUNNING
(follower read a UPTODATE
packet).
This is a pretty small, targeted fix now, is there anything controversial about it or would it be possible to merge it and cut a release soon? The change needs to be rebased.
I reworked the Learner-socket-close code to be a bit less prone to misuse, while handling a conflict with the first commit, just now.
https://github.com/apache/zookeeper/commit/d2ee4dd050174c24c9f645102072ceaccd25f0fd fixes the last comment by @changruill.
Hello, would it be possible to move forward with this PR and merge it and cut a release? Are there any outstanding concerns or objections?
@jonmv @tsuna Please close any outstanding PR/Jira ticket that you think is already superseded by something else. Also please follow the e-mail thread on the dev list which tries to wrap up what's needed for the release. Thanks.
Hi @jonmv, I added a test case for this pr in jonmv/zookeeper#1. Could you please take a look ?
A seemingly unrelated unit test failed in the PR jenkins job: it sleeps for a while, while waiting for some state that wasn't reached within timeout; it doesn't depend on shutdown logic, and the test also runs fine locally.
Merged. Thanks @jonmv !
@jonmv What is your Jira id?
@jonmv What is your Jira id?
It's also "jonmv".
@jonmv What is your Jira id?
It's also "jonmv".
All done. Thank you!
Hey @jonmv thanks for the fix. Can https://issues.apache.org/jira/browse/ZOOKEEPER-4502 (referenced in https://github.com/apache/zookeeper/pull/1925) be closed now that this is merged?
I have closed ZOOKEEPER-4502.
@bbarnes52 Thank you for your information!