dask-gateway
dask-gateway copied to clipboard
Gateway fails to shutdown if kernel restarted while the graph is executing
I've encountered this situation a few times now where I realize right after I start a computation that my worker size is too small and I know the task graph will fail. So I instinctually restart my jupyter kernel, up the worker size and create a new cluster (on GKE with dask-gateway). If I restart the kernel while the graph is executing, the cluster that I had created is not auto-killed, I suspect because the client can't signal to the gateway that it should kill the cluster. I think it's a fair assumption that users will restart their kernel during a computation, so we likely need a pattern for how to best handle this. Any thoughts on how this condition should be handled?
How are you restarting your kernel? Do you just hit "restart kernel"? How long does this take? I'm wondering if Jupyter is force-killing the kernel since graceful shutdown takes too long, so our cleanup handlers don't have time to run.
pressing 00
in the notebook interface to make the restart kernel UI element pop up. Then it restarts the kernel basically immediately.
tldr; This has nothing to do with dask-gateway, and has to do with how Jupyter implements kernel restarts. A process workaround is to interrupt the kernel before restarting. We could make our code more complicated to handle this, but I'm not sure if it's worth it.
Hmmm, I've tracked this down to issues with how Jupyter Notebook restarts kernels. AFAICT when a kernel is restarted, it doesn't interrupt any currently running code. It starts the shutdown process, waits a short period, and if it's not fully shutdown it sends SIGKILL
. This means that it's quite easy for atexit
handlers to never run if there is executing code blocking them.
Here's a reproducer that doesn't use dask (paste the following into a notebook cell, run, then restart before the cell completes). If the atexit handler ran it would delete the created file. As is the file isn't deleted.
import atexit
import os
import time
with open("temp-file", "w") as f:
f.write("This file should be deleted by the exit handler")
@atexit.register
def exit_handler():
os.unlink("temp-file")
time.sleep(1000)
# Restart the kernel, `temp-file` should be deleted but isn't.
# If you interrupt, then restart, then `temp-file` is deleted
I'm not sure what to do here. For Python scripts and terminal usage, etc... the normal signal handling behavior is usually sufficient - SIGINT -> handlers cleanup -> process exits. We could make this more robust by adding an intermediate watcher process that waits for the parent to exit then notifies the gateway, but I'd rather not if that can be avoided.
Filed an upstream issue: https://github.com/ipython/ipykernel/issues/462