dask-gateway icon indicating copy to clipboard operation
dask-gateway copied to clipboard

Gateway fails to shutdown if kernel restarted while the graph is executing

Open ericdill opened this issue 5 years ago • 4 comments

I've encountered this situation a few times now where I realize right after I start a computation that my worker size is too small and I know the task graph will fail. So I instinctually restart my jupyter kernel, up the worker size and create a new cluster (on GKE with dask-gateway). If I restart the kernel while the graph is executing, the cluster that I had created is not auto-killed, I suspect because the client can't signal to the gateway that it should kill the cluster. I think it's a fair assumption that users will restart their kernel during a computation, so we likely need a pattern for how to best handle this. Any thoughts on how this condition should be handled?

ericdill avatar Oct 15 '19 15:10 ericdill

How are you restarting your kernel? Do you just hit "restart kernel"? How long does this take? I'm wondering if Jupyter is force-killing the kernel since graceful shutdown takes too long, so our cleanup handlers don't have time to run.

jcrist avatar Oct 15 '19 15:10 jcrist

pressing 00 in the notebook interface to make the restart kernel UI element pop up. Then it restarts the kernel basically immediately.

ericdill avatar Oct 15 '19 15:10 ericdill

tldr; This has nothing to do with dask-gateway, and has to do with how Jupyter implements kernel restarts. A process workaround is to interrupt the kernel before restarting. We could make our code more complicated to handle this, but I'm not sure if it's worth it.


Hmmm, I've tracked this down to issues with how Jupyter Notebook restarts kernels. AFAICT when a kernel is restarted, it doesn't interrupt any currently running code. It starts the shutdown process, waits a short period, and if it's not fully shutdown it sends SIGKILL. This means that it's quite easy for atexithandlers to never run if there is executing code blocking them.

Here's a reproducer that doesn't use dask (paste the following into a notebook cell, run, then restart before the cell completes). If the atexit handler ran it would delete the created file. As is the file isn't deleted.

import atexit
import os
import time

with open("temp-file", "w") as f:
    f.write("This file should be deleted by the exit handler")

@atexit.register
def exit_handler():
    os.unlink("temp-file")
    
time.sleep(1000)

# Restart the kernel, `temp-file` should be deleted but isn't.
# If you interrupt, then restart, then `temp-file` is deleted

I'm not sure what to do here. For Python scripts and terminal usage, etc... the normal signal handling behavior is usually sufficient - SIGINT -> handlers cleanup -> process exits. We could make this more robust by adding an intermediate watcher process that waits for the parent to exit then notifies the gateway, but I'd rather not if that can be avoided.

jcrist avatar Dec 03 '19 21:12 jcrist

Filed an upstream issue: https://github.com/ipython/ipykernel/issues/462

jcrist avatar Dec 03 '19 21:12 jcrist