zookeeper
zookeeper copied to clipboard
ZOOKEEPER-4400 Zookeeper not getting Graceful Termination
Register shutdown hook in order for zookeeper server to respond to SIGTERM signals and shut down gracefully. Introduce registerShutdownHook
config flag (defaults to false
) for backward compatibility.
@symat could you please restart the failing job? I'm pretty sure the failure is accidental - the tests passed locally, the failing test is timeout based and failed only on 3rd instance (out of 3).
Thank you @luke-sterkowicz for the contribution!!
I re-triggered the CI, but don't worry about it (you were hitting a known flaky test, it is unrelated to your PR)
Introduce registerShutdownHook config flag (defaults to false) for backward compatibility.
I'm not totally sure this parameter is required. Backward compatibility is important, but I don't think anyone relied on the fact that ZooKeeper haven't responded to SIGTERM signals before. @eolivelli , @anmolnar , what do you think?
Anyway, if we introduce the parameter, then please document it in the admin guide: https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
Thanks!!
@symat probably, although the shutdown will take slightly longer (in rare cases may be long enough for some people to consider it a regression), so I just wanted to be on a safe side. Let me know your decision, I'm ok both ways (removing the flag or updating the guide).
shutdown will take slightly longer (in rare cases may be long enough for some people to consider it a regression)
Thanks! I think this is a good reason to disable it by default. Also worth mention it in the documentation. Sorry for the late answer, I was on vacation.
Also worth mention it in the documentation. Sorry for the late answer, I was on vacation.
No worries! I pushed a small fix to align with other boolean parsing and updated 'advanced config' section in admin guide. I didn't add "New in X.x" info as I guess this is yet TBD.
I didn't add "New in X.x" info as I guess this is yet TBD.
please add "New in 3.9.0" to this PR, as it was opened against the master branch. If we backport this to other branches (e.g. branch-3.8, branch-3.7), then we can change this in the given backport PRs. But usually we only backport bugfixes and security fixes to older branches.
please add "New in 3.9.0" to this PR
okay, added
you can never expect that the graceful shutdown works. it adds lots of races. it is better that we guarantee that ZooKeeper works well with KILL -9
you can never expect that the graceful shutdown works. it adds lots of races.
That's surprising because I believe during leader election zk does an equivalent of graceful shutdown internally, no?
it is better that we guarantee that ZooKeeper works well with KILL -9
There's zk and there's its clients and environment they operate in. In most cases whenever there's SIGTERM, there's also following SIGKILL after a timeout (if SIGTERM fails). So even if graceful shutdown stucks (as you claim), that shouldn't be a problem really and the parent/operator at least has a chance to try.
Lack of this provided us a lots of problems in a k8s env with LB on the front of zk. And while LB may not be the most fortunate solution for zk, there will be justifiable scenarios from time to time, and people will face problems and incidents from time to time (like the original reporter of ZOOKEEPER-4400).
@eolivelli
This is working like that by design. why do you need such behaviour ?
I remember that @phunt once said "Kill -9 is a feature"
Why would we want this to be the normal case though? Graceful shutdown of connections allows us to avoid dangling connections and make normal maintenance operations less likely to cause issues. Even zkServer.sh stop
just sends a kill
and doesn't cause a graceful shutdown.
you can never expect that the graceful shutdown works. it adds lots of races. it is better that we guarantee that ZooKeeper works well with KILL -9
There is a graceful shutdown implemented in Zookeeper, it's just not triggered through a signal. Why would this handler introduce additional races if it triggers the same path?
I think this patch is useful. If I understand correctly it doesn't alter the existing behaviour, so ZooKeeper will do exactly the same when killed with 'kill -9'. Instead, it just adds an additional option to gracefully shutdown when received SIGTERM.
If it helps running in k8s environment as @luke-sterkowicz mentions, why would we reject it?
it is better that we guarantee that ZooKeeper works well with KILL -9
We still do @eolivelli , don't we?
@luke-sterkowicz As the patch doesn't harm the existing logic, why to put this behind a config option?