dask-jobqueue
dask-jobqueue copied to clipboard
Add support for dask-ctl
As mentioned in #543 it would be really nice for dask-jobqueue
to support dask-ctl
for convenient cluster management. However from what I understand about HPC scheduling systems this may not be a trivial task.
Dask Control aims to allow users to create/list/scale/delete Dask clusters via the CLI and a Python API. Support for dask-ctl
is implements on a per-cluster manager basis with the following tasks.
- It must be possible to delete a Cluster object without destroying the Dask cluster
- It must be possible to list all running clusters
- It must be possible to create a new instance of the Cluster object that represents an existing cluster
The main challenges here are around moving the state out of the Cluster object into a place that it can be retrieved later. On platforms like Kubernetes or the Cloud much of the state can be serialised into tags/labels on the various tasks, but I'm not sure how many HPC systems support this kind of metadata storage.
The other challenge is how to discover clusters. On Kubernetes for example we set a tag on all resources that marks it as being created by dask-ctl
and stores an ID that can be used to retrieve the metadata. Again I'm not sure how flexible HPC schedulers are at being able to tag/label jobs with arbitrary metadata.
The last thing that maybe a blocker is that the Dask cluster must always run the scheduler remotely, it cannot be within the local (or login node) Python process. I'm not sure how that affects things here.
I'm keen to see this happen, and if folks have thoughts on how this can be implemented I'd be keen to hear.
Apologies in advance for naive questions, is a ClusterManager
a dask-ctl
concept or a more general dask
concept?
On every HPC system I've used the head node and compute nodes have access to at least one shared filesystem, my brain naively jumped to storing the state of each cluster in text files there. Can you see obvious reasons this wouldn't work?
One potential thing we would need to deal with is that cluster admins sometimes kill long running jobs on head nodes (c.f. #471)
Yeah distributed.deploy.Cluster
is an interface that we use all over Dask. The SLURMCluster
, PBSCluster
, etc classes in this project subclass Cluster
. We also use it for Kubernetes, Cloud, Hadoop, Local, etc so that users can move between backends with some consistency.
Please ask as many questions as you like, it helps identify things that are undocumented 😂. I'm happy to answer whatever questions you have.
I think we would need to think about moving the scheduler off the head node and into a job in the cluster, but I might be wrong here.
A shared filesystem feels like a reasonably safe assumption. It's just the inconsistency of the implementation that would worry me.
Let's use SLURM as an example. Cluster discovery and reconstruction would look like this:
- We could in theory list all running jobs on a SLURM cluster looking for jobs running the
dask-scheduler
command.- We would need to get the ID of each scheduler somehow.
- We would need to find all the worker jobs associated with each scheduler
- How would we find these? Perhaps in the metadata file?
- We would need to find the config used to create those worker jobs (in order to scale up and create more)
- How would we consistently find this file across systems? While a shared filesystem likely exists we won't know where it is mounted.
- Connect an RPC to a scheduler.
- Reconstruct a
SLURMCluster
object around that config and RPC.
Gotcha, appreciate the explanation! - my confusion was about whether Cluster
and ClusterManager
were actually the same thing which is now clear 🙂
We could in theory list all running jobs on a SLURM cluster looking for jobs running the dask-scheduler command. We would need to get the ID of each scheduler somehow.
This one could maybe be automatically encoded in the job name?
We would need to find all the worker jobs associated with each scheduler How would we find these? Perhaps in the metadata file?
If we find a way to hook into the filesystem, something as simple as text files per scheduler which are created/destroyed on job start/end? Not sure if machinery for this already exists but it could be added
We would need to find the config used to create those worker jobs (in order to scale up and create more) How would we consistently find this file across systems? While a shared filesystem likely exists we won't know where it is mounted.
This is tougher for sure - I'll have a think 🙂
Thanks for step-by-stepping it, super useful!
It is typical on HPC to use the --scheduler-file
flag when starting a cluster, I wonder if we could repurpose this somehow?
We would have to check if all HPC schedulers support this but when we list the running jobs it is likely we would have the command that was invoked to start the scheduler. So if things like the ID and config path were passed to the scheduler as arguments we could parse those out again. Again that might be a big assumption though.
My knowledge of jobqueue systems is pretty limited, but I wonder if it would be possible to use environment variables for this? This is a standard OS feature so it is very likely that every single cluster implementation supports it (though the mechanism to export the variables to the job might vary).
It should also be possible to retrieve the values of the exported variables from the jobs.
The only issue I can imagine would be size limits on the environment variables, but it should be possible to work around those (and ID and config path do not sound like a lot of characters).