dask-jobqueue Add support for dask-ctl

As mentioned in #543 it would be really nice for dask-jobqueue to support dask-ctl for convenient cluster management. However from what I understand about HPC scheduling systems this may not be a trivial task.

Dask Control aims to allow users to create/list/scale/delete Dask clusters via the CLI and a Python API. Support for dask-ctl is implements on a per-cluster manager basis with the following tasks.

It must be possible to delete a Cluster object without destroying the Dask cluster
It must be possible to list all running clusters
It must be possible to create a new instance of the Cluster object that represents an existing cluster

The main challenges here are around moving the state out of the Cluster object into a place that it can be retrieved later. On platforms like Kubernetes or the Cloud much of the state can be serialised into tags/labels on the various tasks, but I'm not sure how many HPC systems support this kind of metadata storage.

The other challenge is how to discover clusters. On Kubernetes for example we set a tag on all resources that marks it as being created by dask-ctl and stores an ID that can be used to retrieve the metadata. Again I'm not sure how flexible HPC schedulers are at being able to tag/label jobs with arbitrary metadata.

The last thing that maybe a blocker is that the Dask cluster must always run the scheduler remotely, it cannot be within the local (or login node) Python process. I'm not sure how that affects things here.

I'm keen to see this happen, and if folks have thoughts on how this can be implemented I'd be keen to hear.

Jan 31 '22 14:01 jacobtomlinson

Apologies in advance for naive questions, is a ClusterManager a dask-ctl concept or a more general dask concept?

On every HPC system I've used the head node and compute nodes have access to at least one shared filesystem, my brain naively jumped to storing the state of each cluster in text files there. Can you see obvious reasons this wouldn't work?

One potential thing we would need to deal with is that cluster admins sometimes kill long running jobs on head nodes (c.f. #471)

Jan 31 '22 14:01 alisterburt

Yeah distributed.deploy.Cluster is an interface that we use all over Dask. The SLURMCluster, PBSCluster, etc classes in this project subclass Cluster. We also use it for Kubernetes, Cloud, Hadoop, Local, etc so that users can move between backends with some consistency.

Please ask as many questions as you like, it helps identify things that are undocumented 😂. I'm happy to answer whatever questions you have.

I think we would need to think about moving the scheduler off the head node and into a job in the cluster, but I might be wrong here.

A shared filesystem feels like a reasonably safe assumption. It's just the inconsistency of the implementation that would worry me.

Let's use SLURM as an example. Cluster discovery and reconstruction would look like this:

We could in theory list all running jobs on a SLURM cluster looking for jobs running the dask-scheduler command.
- We would need to get the ID of each scheduler somehow.
We would need to find all the worker jobs associated with each scheduler
- How would we find these? Perhaps in the metadata file?
We would need to find the config used to create those worker jobs (in order to scale up and create more)
- How would we consistently find this file across systems? While a shared filesystem likely exists we won't know where it is mounted.
Connect an RPC to a scheduler.
Reconstruct a SLURMCluster object around that config and RPC.

Jan 31 '22 16:01 jacobtomlinson

Gotcha, appreciate the explanation! - my confusion was about whether Cluster and ClusterManager were actually the same thing which is now clear 🙂

We could in theory list all running jobs on a SLURM cluster looking for jobs running the dask-scheduler command. We would need to get the ID of each scheduler somehow.

This one could maybe be automatically encoded in the job name?

We would need to find all the worker jobs associated with each scheduler How would we find these? Perhaps in the metadata file?

If we find a way to hook into the filesystem, something as simple as text files per scheduler which are created/destroyed on job start/end? Not sure if machinery for this already exists but it could be added

We would need to find the config used to create those worker jobs (in order to scale up and create more) How would we consistently find this file across systems? While a shared filesystem likely exists we won't know where it is mounted.

This is tougher for sure - I'll have a think 🙂

Thanks for step-by-stepping it, super useful!

Jan 31 '22 19:01 alisterburt

It is typical on HPC to use the --scheduler-file flag when starting a cluster, I wonder if we could repurpose this somehow?

We would have to check if all HPC schedulers support this but when we list the running jobs it is likely we would have the command that was invoked to start the scheduler. So if things like the ID and config path were passed to the scheduler as arguments we could parse those out again. Again that might be a big assumption though.

Feb 01 '22 10:02 jacobtomlinson

My knowledge of jobqueue systems is pretty limited, but I wonder if it would be possible to use environment variables for this? This is a standard OS feature so it is very likely that every single cluster implementation supports it (though the mechanism to export the variables to the job might vary).

It should also be possible to retrieve the values of the exported variables from the jobs.

The only issue I can imagine would be size limits on the environment variables, but it should be possible to work around those (and ID and config path do not sound like a lot of characters).

Dec 27 '22 10:12 keewis

dask-jobqueue dask-jobqueue copied to clipboard

Add support for dask-ctl

dask-jobqueue
dask-jobqueue copied to clipboard