dask-gateway
dask-gateway copied to clipboard
k8s-backed cluster remains active when client's kernel dies unintentionally
What happened:
I'm not sure if this is a bug report or feature request, but the behavior is not desired for our use case and I can't seem to figure out a way to get the desired behavior. The issue occurs when the kernel for a notebook containing a client (which is linked to a GatewayCluster
) unintentionally dies (e.g. due to a memory overload issue). We're using the kubernetes backend within a daskhub
helm chart configuration. Even if the cluster has been created with shutdown_on_close=True
, the scheduler does not seem to shut down in these instances and instead remains active until its idle_timeout
setting is reached and can be reconnected to from a new notebook. This does not occur if the cluster is shutdown explicitly, nor does it occur if you intentionally shut down (or restart) the kernel.
What you expected to happen:
The behavior I expect (and desire) is based off of my experience with dask-kubernetes
, in which I was mainly using a local scheduler. In this case, the scheduler was virtually always in the same kernel as the notebook, so when one died, the other died. I recognize the dask-gateway
remote scheduler situation might be a bit more difficult to handle, and that in some cases you would want your scheduler to persist in a situation like this. But I'm wondering if there is a flag that could be used to signify if you want to shut down a cluster even if a client connection is closed unintenionally.
Anything else we need to know?: If this is just a matter of implementation, I'm happy to take a crack at this if others have a sense of where to start. I did a bit of digging but didn't have a great idea of where this behavior might be altered.
Environment:
- Dask version: 2021.4.1
- Python version: 3.8.8
- Operating System: Linux (
daskhub
helm-chart defining a kuberentes setup)