zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

ZOOKEEPER-4400 Zookeeper not getting Graceful Termination

Open luke-sterkowicz opened this issue 2 years ago • 13 comments

Register shutdown hook in order for zookeeper server to respond to SIGTERM signals and shut down gracefully. Introduce registerShutdownHook config flag (defaults to false) for backward compatibility.

luke-sterkowicz avatar Jun 29 '22 10:06 luke-sterkowicz

@symat could you please restart the failing job? I'm pretty sure the failure is accidental - the tests passed locally, the failing test is timeout based and failed only on 3rd instance (out of 3).

luke-sterkowicz avatar Jul 11 '22 21:07 luke-sterkowicz

Thank you @luke-sterkowicz for the contribution!!

I re-triggered the CI, but don't worry about it (you were hitting a known flaky test, it is unrelated to your PR)

Introduce registerShutdownHook config flag (defaults to false) for backward compatibility.

I'm not totally sure this parameter is required. Backward compatibility is important, but I don't think anyone relied on the fact that ZooKeeper haven't responded to SIGTERM signals before. @eolivelli , @anmolnar , what do you think?

Anyway, if we introduce the parameter, then please document it in the admin guide: https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

Thanks!!

symat avatar Jul 12 '22 10:07 symat

@symat probably, although the shutdown will take slightly longer (in rare cases may be long enough for some people to consider it a regression), so I just wanted to be on a safe side. Let me know your decision, I'm ok both ways (removing the flag or updating the guide).

luke-sterkowicz avatar Jul 13 '22 19:07 luke-sterkowicz

shutdown will take slightly longer (in rare cases may be long enough for some people to consider it a regression)

Thanks! I think this is a good reason to disable it by default. Also worth mention it in the documentation. Sorry for the late answer, I was on vacation.

symat avatar Jul 19 '22 08:07 symat

Also worth mention it in the documentation. Sorry for the late answer, I was on vacation.

No worries! I pushed a small fix to align with other boolean parsing and updated 'advanced config' section in admin guide. I didn't add "New in X.x" info as I guess this is yet TBD.

luke-sterkowicz avatar Jul 19 '22 10:07 luke-sterkowicz

I didn't add "New in X.x" info as I guess this is yet TBD.

please add "New in 3.9.0" to this PR, as it was opened against the master branch. If we backport this to other branches (e.g. branch-3.8, branch-3.7), then we can change this in the given backport PRs. But usually we only backport bugfixes and security fixes to older branches.

symat avatar Jul 19 '22 12:07 symat

please add "New in 3.9.0" to this PR

okay, added

luke-sterkowicz avatar Jul 19 '22 12:07 luke-sterkowicz

you can never expect that the graceful shutdown works. it adds lots of races. it is better that we guarantee that ZooKeeper works well with KILL -9

eolivelli avatar Jul 19 '22 14:07 eolivelli

you can never expect that the graceful shutdown works. it adds lots of races.

That's surprising because I believe during leader election zk does an equivalent of graceful shutdown internally, no?

it is better that we guarantee that ZooKeeper works well with KILL -9

There's zk and there's its clients and environment they operate in. In most cases whenever there's SIGTERM, there's also following SIGKILL after a timeout (if SIGTERM fails). So even if graceful shutdown stucks (as you claim), that shouldn't be a problem really and the parent/operator at least has a chance to try.

Lack of this provided us a lots of problems in a k8s env with LB on the front of zk. And while LB may not be the most fortunate solution for zk, there will be justifiable scenarios from time to time, and people will face problems and incidents from time to time (like the original reporter of ZOOKEEPER-4400).

luke-sterkowicz avatar Jul 19 '22 20:07 luke-sterkowicz

@eolivelli

This is working like that by design. why do you need such behaviour ?

I remember that @phunt once said "Kill -9 is a feature"

Why would we want this to be the normal case though? Graceful shutdown of connections allows us to avoid dangling connections and make normal maintenance operations less likely to cause issues. Even zkServer.sh stop just sends a kill and doesn't cause a graceful shutdown.

you can never expect that the graceful shutdown works. it adds lots of races. it is better that we guarantee that ZooKeeper works well with KILL -9

There is a graceful shutdown implemented in Zookeeper, it's just not triggered through a signal. Why would this handler introduce additional races if it triggers the same path?

behos avatar Jun 08 '23 11:06 behos

I think this patch is useful. If I understand correctly it doesn't alter the existing behaviour, so ZooKeeper will do exactly the same when killed with 'kill -9'. Instead, it just adds an additional option to gracefully shutdown when received SIGTERM.

If it helps running in k8s environment as @luke-sterkowicz mentions, why would we reject it?

it is better that we guarantee that ZooKeeper works well with KILL -9

We still do @eolivelli , don't we?

anmolnar avatar Jun 11 '23 18:06 anmolnar

@luke-sterkowicz As the patch doesn't harm the existing logic, why to put this behind a config option?

anmolnar avatar Jun 11 '23 18:06 anmolnar