zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

Jonmv/zookeeper 4541 take 2

Open jonmv opened this issue 10 months ago • 6 comments

This PR fixes the shutdown errors that were added in #157, and also avoids a common NPE during ZK shutdown from a learner, when the leader shuts down (first commit). Together with #2111 and #2152, this should cover all the fixes in #1925.

We've had the forked ZK from #1925 running embedded in hundreds, if not thousands, of ZK clusters, with rolling restarts most days, and we've had zero cases of inconsistent data since we patched—one or a few cases per week before that. (We still sometimes see ephemeral nodes remain after the leader is brutally taken down, i.e., with Runtime.halt(), but this looks different; it seems clearing out client sessions, and their ephemeral nodes, simply isn't done when death is too sudden.)

jonmv avatar Apr 04 '24 11:04 jonmv

LGTM.

Together with #2111 and #2152, this should cover all the fixes in #1925.

Yes I believe so.

BTW, could u please recommit this pr, since it wasn't built successfully.

AlphaCanisMajoris avatar Apr 10 '24 13:04 AlphaCanisMajoris

Will this fix be backported in the 3.8 train? We just hit this bug on one of our clusters, it's a shame we've had various fixes up for review for over 20 months, I would appreciate you guys pushing this through the finish line so this critical issue can be closed. Thanks!

tsuna avatar Jun 13 '24 12:06 tsuna

One more thing, the resources (Threads/Processors...) created in startupWithServerState(State.INITIAL) won't be released in shutdown, cause of canShutdown does not contain the condition state == State.INITIAL. This LEAK would occur before ZooKeeperServer.state changes to State.RUNNING (follower read a UPTODATE packet).

changruill avatar Jun 14 '24 05:06 changruill

This is a pretty small, targeted fix now, is there anything controversial about it or would it be possible to merge it and cut a release soon? The change needs to be rebased.

tsuna avatar Jun 19 '24 21:06 tsuna

I reworked the Learner-socket-close code to be a bit less prone to misuse, while handling a conflict with the first commit, just now.

jonmv avatar Aug 12 '24 12:08 jonmv

https://github.com/apache/zookeeper/commit/d2ee4dd050174c24c9f645102072ceaccd25f0fd fixes the last comment by @changruill.

jonmv avatar Aug 28 '24 09:08 jonmv

Hello, would it be possible to move forward with this PR and merge it and cut a release? Are there any outstanding concerns or objections?

tsuna avatar Sep 18 '24 11:09 tsuna

@jonmv @tsuna Please close any outstanding PR/Jira ticket that you think is already superseded by something else. Also please follow the e-mail thread on the dev list which tries to wrap up what's needed for the release. Thanks.

anmolnar avatar Sep 18 '24 18:09 anmolnar

Hi @jonmv, I added a test case for this pr in jonmv/zookeeper#1. Could you please take a look ?

kezhuw avatar Sep 19 '24 14:09 kezhuw

A seemingly unrelated unit test failed in the PR jenkins job: it sleeps for a while, while waiting for some state that wasn't reached within timeout; it doesn't depend on shutdown logic, and the test also runs fine locally.

jonmv avatar Sep 20 '24 11:09 jonmv

Merged. Thanks @jonmv !

anmolnar avatar Sep 20 '24 19:09 anmolnar

@jonmv What is your Jira id?

anmolnar avatar Sep 20 '24 19:09 anmolnar

@jonmv What is your Jira id?

It's also "jonmv".

jonmv avatar Sep 21 '24 16:09 jonmv

@jonmv What is your Jira id?

It's also "jonmv".

All done. Thank you!

anmolnar avatar Sep 23 '24 22:09 anmolnar

Hey @jonmv thanks for the fix. Can https://issues.apache.org/jira/browse/ZOOKEEPER-4502 (referenced in https://github.com/apache/zookeeper/pull/1925) be closed now that this is merged?

bbarnes52 avatar Oct 28 '24 23:10 bbarnes52

I have closed ZOOKEEPER-4502.

@bbarnes52 Thank you for your information!

kezhuw avatar Oct 29 '24 02:10 kezhuw