dask-gateway
dask-gateway copied to clipboard
Optionally set jupyterhub pod as DaskCluster owner?
Over in https://github.com/pangeo-data/pangeo-binder/issues/143, we've noticed that some dask-gateway pods hang around. The root cause is likely something like https://github.com/ipython/ipykernel/issues/462. But even if that's fixed, there might be cases like a segfault in the client where we can't hope to inform other pods to close.
When deployed with jupyterhub, we might have a second option for cleaning up this pods. We could add the jupyterhub user pod as an owner of the daskcluster
thing (pod? app?) created when a user does gateway.new_cluster()
. Then when the jupyterhub user pod goes away then the dask-gateway resources are cleaned up too.
Does that seem OK to you (perhaps as an option exposed in the helm chart)? If so I'd love to work on it. It'd be a good way to dive into the internals a bit.
This is possible, but tricky. We'd need to do the following:
- Use the downwards api to make the JupyterHub pod aware of it's own name and uid
- Send those along with the user's request, as part of the cluster options. We could make this automatic with the existing templated client-side defaults, but currently those options would show up to the user in the widget. We might add support for
hidden
options (things that shouldn't show up in the widget) to make this nicer. - The api server would need to then set the
ownerReferences
to give the jupyterhub pod ownership of thedaskcluster
object. This isn't currently possible with configuration, so something would have to be added here.
In general this all seems doable and possibly in-scope, but it feels a bit messy. I think #255 provides a better general solution. However, having pods linger around isn't great - would you be free to chat at some point to help debug what's going on? Most graceful uses should clean up nicely.
Oh https://github.com/dask/dask-gateway/issues/255 does sound better... Probably just close this issue then?
would you be free to chat at some point to help debug what's going on? Most graceful uses should clean up nicely.
Happy to chat, but probably wouldn't be helpful at this point. The lingering resources are on our binder, so we don't really have any idea what users / actions are causing issues. But everything is consistent with "non-graceful exit" so that's where I'm focusing.
But I completely missed the idle_timeout option! I'm going to start with that.
This may be something we'd still like to support (I need to sleep on this for a bit), so if it's fine by you I'd like to leave this open for a bit. I think #255 does a better job of solving the general issue, but this may be useful as well.