dask-gateway
dask-gateway copied to clipboard
Add option to shutdown cluster if client disconnects after a timeout
When a GatewayCluster
object is cleaned up, by default it will shutdown the associated cluster. In a perfect scenario this works fine, but because this requires an external API call and is implemented using __del__
(and atexit
as a fallback), it comes with a few caveats.
-
It can't work if the interpreter is killed (no way to register a signal handler for SIGKILL). It alsp currently won't work with SIGTERM, but we could register a signal handler for that. I'm a bit hesitant to (since that's a global change to the interpreter), but it could be done.
One case where this comes up periodically is if the user restarts their notebook kernel while it's computing something. In this case the kernel is terminated in a non-graceful manner and the atexit handlers don't run (see https://github.com/dask/dask-gateway/issues/155).
-
It can't work in the presence of segfaults.
-
It can't work in the presence of network failures
In common cases though (normal Python shutdown is allowed to succeed), the existing mechanism works. Other dask cluster managers that support external schedulers (dask-yarn, dask-kubernetes) will have the same problems, this isn't specific to dask-gateway.
I expect the failure mode here to be uncommon, but still occur often enough that we'll want a way to handle it. We currently support shutting down idle clusters after a timeout (with ClusterConfig.idle_timeout
), but this also affects clusters with active connections (i.e. if you sit and think for longer than this timeout, your cluster will shutdown).
A different option might be to add another timeout for disconnect from any external clients. This could be set to a smaller value (perhaps 30-120s seems reasonable) to catch disconnects earlier. This would work in all cases when the client session closed.
There's some logic decisions to make here:
- Do we shutdown only if no external clients and the cluster is idle? Or even if the cluster is still computing?
- Currently determining which clients are external clients (meaning not a
worker_client
) isn't possible - we don't have access to that information. This would require an upstream change to distributed. - Alternatively we could shutdown if the connection is dropped (after a timeout) from the
GatewayCluster
object. This could be handled without upstream changes, but I'm not sure if this is the right shutdown condition.
Reference issues:
- #155
- https://github.com/pangeo-data/pangeo-binder/pull/139#issuecomment-615248146
I'm wondering if this issue has persisted. Using distributed 2021.6.0 and gateway 0.9.0, with the kuberentes backend (running with the 2021.6.0 daskhub helm chart), I'm getting this issue even when the notebook kernel is restarted NOT in the middle of a computation. i.e. if I just do:
from dask_gateway import GatewayCluster
from dask.distributed import Client
cluster = GatewayCluster()
And then restart my notebook kernel, the scheduler pod persists. If I scale up and then restart, the worker pods will persist as well. If I add a client before restarting the kernel, the closing of the client connection does register in the logs of the scheduler pod, but it does not trigger that pod to shut down.
Explicitly closing the cluster cluster.close()
triggers the scheduler to shutdown normally.
I everyone, I found this issue while searching if there was something about Gateway clusters not terminating automatically when main Python process exits.
I'm in the same situation as @bolliger32 above with up to date versions of the libraries (dask/distirbuted 2022.7.0, dasg-gateway 2022.6.1).
From what I read in the documentation, I was expecting dask clusters to be stopped when I stop my kernel or the Jupyter server I'm using, but this is not the case.
Should we open another issue?