dask-jobqueue Add RemoteSlurmJob to connect SLURMCluster to a remote Slurm cluster

Hello 👋 Thank you for considering this feature request :) I have been looking over dask-jobqueue (together with prefect) to allocate resources on a Slurm cluster I have access to. dask-jobqueue seems exactly what we'd need for this, thank you for maintaining it 🙇

Context

In the case where Slurm and a python process (script or notebook) are not running on the same host SlurmCluster will not be able to spawn any jobs and error.

Slurm added a REST API: https://slurm.schedmd.com/rest_api.html

Feature

Could we add a RemoteSlurmCluster and RemoteSlurmJob that largely extend SlurmCluster/SlurmJob and instead of using subprocess.Popen we'd do an HTTP request instead?

class RemoteSLURMJob(SLURMJob):
    @contextmanager
    def job_file(self):
        # we don't need the script file but only the script itself. to not alter `async def start(self):` we yield the script here.
        yield self.job_script()

    async def _submit_job(self, script):
        # formatting should be according to:
        # https://slurm.schedmd.com/rest_api.html#slurmctldSubmitJob
        return requests.post('slurm-url/jobs/submit')  # reach out to API

    def _job_id_from_submit_output(self, out):
        # out is the JSON output from _submit_job request.post
        # See https://slurm.schedmd.com/rest_api.html#v0.0.36_job_submission_response
        return out['job_id']

    @classmethod
    def _close_job(cls, job_id):
        # See: https://slurm.schedmd.com/rest_api.html#slurmctldCancelJob
        requests.delete(f'slurm-url/job/{job_id}')


class RemoteSLURMCluster(SLURMCluster):
    job_cls = RemoteSLURMJob

As far as I can tell this should be a drop-in replacement.

Thoughts? (tagging @mrocklin @lesteve for visibility, hope you would have time for a review.)

Jun 25 '21 14:06 alexandervaneck

This proposal sounds totally reasonable to me. Do you have any interest in raising a PR?

Jun 28 '21 10:06 jacobtomlinson

Very much! I'm only waiting on the infrastructure that I have access to to enable the SLURM REST API, so I can test it.

Would you have time for a review sometime this week/next week @jacobtomlinson?

Jun 28 '21 10:06 alexandervaneck

Happy to review but I think it best for one of the core maintainers of this repo (@guillaumeeb, @lesteve) to do the final merge.

Jun 28 '21 11:06 jacobtomlinson

This sounds also totally reasonable to me, as it keeps the general concept of dask-jobqueue and should be pretty readable by the look of your snippet!

This is really nice to have a REST API for submitting job, Slurm is definitely a nice job scheduler.

So waiting for your PR 👍 !

Jul 03 '21 09:07 guillaumeeb

Just curious, your use case is to create a SlumRemoteCluster on a host

where Slurm is not installed (for example sbatch does not exist) but you have access to the Slurm REST API endpoint
which is in the same network as the Slurm cluster (I don't know the exact technical term but what I mean is that the Dask scheduler on this host need to communicate over HTTP to the Dask workers on your computing nodes)

Did I get this right?

If so it does not feel like this is a very common situation but I may be missing something of course ...

Jul 06 '21 11:07 lesteve

@lesteve thank you for responding 🙇‍♂️ I've been poking around dask-jobqueue for a few days now and am very happy to see it's been very well maintained. Thank you for this.

Just curious, your use case is to create a SlumRemoteCluster on a host

where Slurm is not installed (for example sbatch does not exist) but you have access to the Slurm REST API endpoint

which is in the same network as the Slurm cluster (I don't know the exact technical term but what I mean is that the Dask scheduler on this host need to communicate over HTTP to the Dask workers on your computing nodes)

Did I get this right?

If so it does not feel like this is a very common situation but I may be missing something of course ...

Yes, that would be correct.

I wouldn't know how to determine if this is a common situation. However I would argue that since SLURM has introduced an REST API to be able to remotely start jobs there must be some users.

The usecase would be allowing a docker container running inside a HPC cluster to call out to SLURM to schedule dask-workers with specific (GPU) resource requirements.

Inside the docker container;

Start SlurmRemoteCluster scheduler
HTTP call to SLURM REST API to start n workers with x resources
n workers come online and register with scheduler
workload processed
HTTP call to SLURM REST API to close n workers
Stop SlurmRemoteCluster scheduler

Jul 06 '21 11:07 alexandervaneck

OK running the Dask scheduler inside a docker container (on a login node I assume) is a use case that makes sense. I did not think of this, thanks!

For security reasons I would think that cluster sys-admins would not allow connecting to the Slurm REST API endpoint from the outside, but maybe these kind of security constraints are only in place for "big" clusters.

I had in mind the ideal setup (unfortunately not possible easily as far as I know ...) where your Dask scheduler lives outside of the cluster and the Dask workers live inside the cluster. For example see https://github.com/dask/dask-jobqueue/issues/471 with more details.

Jul 06 '21 12:07 lesteve

inside a docker container (on a login node I assume)

Yes - or at least somewhere where it can send/receive calls from the SLURM REST API. I would say "inside" the cluster.

Jul 06 '21 12:07 alexandervaneck

Has there been any progress here?

inside a docker container (on a login node I assume)

this is exactly my use case (except with a Singularity container, but it still means I don't have the sbatch binary directly available).

May 06 '22 13:05 poplarShift

This could be of interest as well: https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda

It's ssh'ing back to the host system where srun etc are available.

May 06 '22 13:05 willirath

Has there been any progress here?

Unfortunately the associated PR has gone stale... But there was some work on it, so if anyone want to keep going it would be nice!

May 09 '22 16:05 guillaumeeb

+1 for this.

Oct 27 '22 19:10 zbarr

+1

Jan 05 '23 17:01 bjudkewitz

dask-jobqueue dask-jobqueue copied to clipboard

Add RemoteSlurmJob to connect SLURMCluster to a remote Slurm cluster

Context

Feature

dask-jobqueue
dask-jobqueue copied to clipboard