dask-gateway
dask-gateway copied to clipboard
question: granular cluster resource limits based on user / Jupyter RBAC
Hi,
dask-gateway allows setting resource limits on the clusters that can be created by users .. https://gateway.dask.org/resource-limits.html
but they are global for all users. for a multi-tenant deployment it is often the case that there are various user groups to which different limits might apply.
Jupyter Lab 2.0 introduces some notion of RBAC-based authentication
and I was wondering whether that could be used fo set more granular settings in dask-gateway?
perhaps @consideRatio has thoughts here as well
Thanks for opening this issue.
The resource-limit settings in dask gateway are actually per-cluster (e.g. max cores per cluster). Setting them as described in that doc set global defaults, but those can be overridden by per-user configuration in a options_handler. This takes any options specified by the user when creating the cluster, and the user object itself (see here), and returns a new set of options to apply to that cluster. This flexibility allows for defining whatever per-user/group rules you want, without having to bake those in to dask-gateway itself. For example:
from dask_gateway_server.options import Options
def options_handler(options, user):
# Users in the `power-users` group get bigger clusters with higher limits
if "power-users" in user.groups:
return {
"worker_cores": 8,
"worker_memory": "16 G",
"cluster_max_workers": 100,
}
else:
return {
"worker_cores": 4,
"worker_memory": "8 G",
"cluster_max_workers": 10,
}
c.Backend.cluster_options = Options(handler=options_handler)
Note that when authenticating with JupyterHub, the .groups field mirrors that of JupyterHub.
Excellent this is exactly what we need - is there an option to set a max cluster lifetime after which the cluster will be culled?
There's idle_timeout (https://gateway.dask.org/api-server.html#c.ClusterConfig.idle_timeout), a max time for the cluster to sit idle (unused) before it's culled, but there isn't a total max runtime for the cluster itself. That wouldn't be too tricky to add though if it'd be useful for you. File an issue if so.