swarm-plugin icon indicating copy to clipboard operation
swarm-plugin copied to clipboard

Jenkins 31084 graceful shutdown

Open akomakom opened this issue 5 years ago • 2 comments

Work in progress. Seems to be working so far.

Question: should doGetSlaveReadyForShutdown also check secrets/permissions?

akomakom avatar Jun 05 '19 20:06 akomakom

This looks great so far! I like the use of a ShutdownHook, and the backend APIs are pretty much exactly what I expected.

Have you given any thought as to how this functionality might be activated or deactivated? The Java API states:

Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the virtual machine is terminated due to user logoff or system shutdown the underlying operating system may only allow a fixed amount of time in which to shut down and exit. It is therefore inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

For this reason as well as to preserve backwards compatibility of the existing command-line API, I think this new functionality should probably be opt-in rather than opt-out by default. What do you think?

I have mixed feeling about the design. First off, the expected use case is the opposite of "finish work quickly". I have jobs that take 8 hours, which is how long the ShutdownHook would have to delay VM termination.

Then there is the opt-in. I see two main approaches:

  1. A new flag or signal to enable graceful shutdown behavior.
  2. A second kill required to terminate with prejudice.

Number 2 is not really opt-in since it will require instrumentation changes. It's also not trivial to do using only java's shutdown hooks (without explicitly registering signal hooks), but we might as well require kill -9 the second time to make it easy.

Which leaves these options:

  1. -gracefulShutdown=true changes SIGINT behavior to what I have in this PR. This still violates the "quick" directive for shutdown hooks.
  2. kill -SIGUSR1 (for example) as opposed to a SIGINT/SIGTERM which still kill instantly. (But I have yet to find a signal handling approach that doesn't rely on sun.misc.*)
  3. Some other mechanism, ie an HTTP listener, a file monitor, etc.

I'm in favor of choice 3 because I am not confident about the suitability of shutdown hooks for this use case. HTTP listener is probably too heavy and opens some security vulnerabilities, but a command file (a la nagios.cmd) would work, eg -commandFile /some/path. Then echo 'gracefulShutdown' > /some/path initiates shutdown (details of the file can be hidden by java also, ie java -jar swarm-client.jar gracefulShutdown). Maybe even 2 files - command and status, so that the new process can request shutdown and monitor the results, blocking until ready. Status file can potentially be used for monitoring by external tools.

akomakom avatar Jun 06 '19 17:06 akomakom

Another half-baked idea:

  • Run the jar to shut down the other process, ie java -jar swarm-client.jar .... -gracefulShutdown (same command line + shutdown switch)
  • Communicates with jenkins
  • Jenkins marks original slave offline
  • The original instance determines that it should shut down by also communicating with Jenkins (by polling? by piggybacking on some existing channel?) and exits once jobs finish.

This might not work with unique client names.

Alternatively:

  • Run the jar as above, so that it:
  • Communicates to Jenkins, marks old slave offline.
  • Blocks until jobs finish.
  • Kills the original process.

Again unique client names makes finding the right one to work with difficult if multiple instances are present.

akomakom avatar Jun 06 '19 18:06 akomakom