clusterfutures icon indicating copy to clipboard operation
clusterfutures copied to clipboard

Submit SLURM job arrays instead of separate jobs?

Open Aman-A opened this issue 4 years ago • 8 comments

Hi, I'm not sure if this package is still being actively developed, but I was wondering if you could make the SlurmExecutor capable of submitting job arrays instead of individual jobs. Or, if you have advice on how I could go about implementing that feature. Basically, I'd want to be able to use the map function to submit a job array of a designated size (e.g. --array=1-100), and then map the separate function executions to those workers. The simplest case of requiring the array size to be equal to the number of function evaluations would be sufficient for my applications, e.g. for a for loop over an array of length 100, it always submits a job array with 100 individual jobs, each $SLURM_ARRAY_TASK_ID is mapped to a single value from the array.

Anyway, thanks for developing this!

Aman-A avatar Nov 01 '20 20:11 Aman-A

Sounds cool! Happy to merge a PR, but as I no longer have access to a Slurm cluster, I can't implement it myself. (And I'm afraid I don't remember enough about Slurm to have recommendations for how to implement it… but yeah, I like the idea of submitting jobs in "chunks" like you suggest.)

sampsyo avatar Nov 01 '20 22:11 sampsyo

I think I got it to work for functions that take a single argument, which is all I need for now, but I'll have to spend more time to make it compatible with functions with variable positional/key word arguments (i.e. using *args, **kwargs in the SlurmExecutor.submit_array function I wrote): https://github.com/Aman-A/clusterfutures/commit/19ebb9958e6e17636448efeb974b66e57dcf552c#diff-8ec507c811d69d6164ca569b3458debd21770c0b0a0dc6f29c9bf5f69d14fde6

Aman-A avatar Nov 02 '20 17:11 Aman-A

Neato. Nice work! Feel free to put together a PR if you ever feel like it.

sampsyo avatar Nov 03 '20 00:11 sampsyo

Thanks! I just added another feature allowing for batching the function evaluations within each job, so e.g. you can map N inputs to M jobs which each serially compute N/M of the function evaluations. The cluster IT people at my university want us to submit jobs that take at least 10 min each, so this is useful to avoid a ton of short jobs.

It seems to work so far for my applications, but getting it to work was a little hacky, so I'd hesitate to push it to the master branch in case it doesn't work for others. Maybe you can create a branch and I can create a PR for that? My changes were all done in the slurm_array branch. Otherwise, happy to do it on the master, whichever you prefer

Aman-A avatar Nov 03 '20 00:11 Aman-A

Sure thing! Just made a batch branch: https://github.com/sampsyo/clusterfutures/tree/batch

sampsyo avatar Nov 03 '20 02:11 sampsyo

For your info:

We are planning to implement something slightly related, that is, making a new SlurmExecutor.map() method that executes N tasks by only submitting M<=N SLURM jobs. For instance we may want to execute 100 tasks by batching them in 20 SLURM jobs of 5 tasks each (this should be doable with multiple srun commands and with the appropriate SLURM options like ntasks). The main reason is to avoid hitting job-submission limits in some of our use cases - where we'd like to execute ~400 tasks but we are not allowed to submit 400 SLURM jobs at the same time. We are still at some very preliminary stage - see https://github.com/fractal-analytics-platform/fractal-server/issues/493.

I guess that we'll end up with a very custom implementation, and probably it won't make sense to contribute it back to clusterfutures - but I'd like to leave a trace here about this related (planned) work.

tcompa avatar Mar 16 '23 10:03 tcompa

Thanks @tcompa, I'd be interested to see what you come up with. I've been idly thinking about some way of doing multiple tasks per Slurm job, to allow for finer grained tasks.

I think there is value in clusterfutures being really simple, though, with the straightforward mapping of one task to one job. Multiple tasks per job quickly raises questions like if a job times out, do all its pending tasks fail? Or do we try to schedule them in another job? My gut feeling is that it makes sense to keep clusterfutures simple and be clear about when you should and shouldn't use it. :shrug:

takluyver avatar Mar 20 '23 12:03 takluyver

I fully agree about keeping clusterfutures simple (we are working on this feature due to a specific constraint), and this likely leads to the one-task-one-job approach.

If a job times out, do all its pending tasks fail? Or do we try to schedule them in another job?

Granted, it's not easy. In our project we can be quite strict (if N tasks are being executed in parallel and one of them fails, we already know that all the others are to be considered invalid), which makes things easier.


FYI, to give a glimpse of what we are working on:

  1. On the SLURM-side, we are currently exploring multiple submission scripts with multiple blocks of multiple-srun commands each, for instance like:
#!/bin/bash

#SBATCH --partition=normal

# Execute job steps / first sub-batch
srun --ntasks=1 --nodes=1 --cpus-per-task=$SLURM_CPUS_PER_TASK bash -c "sleep 10; echo 'hello 1'" &
srun --ntasks=1 --nodes=1 --cpus-per-task=$SLURM_CPUS_PER_TASK bash -c "sleep 20; echo 'hello 2'" &
wait

# Execute job steps / second sub-batch
srun --ntasks=1 --nodes=1 --cpus-per-task=$SLURM_CPUS_PER_TASK bash -c "sleep 10; echo 'hello 1'" &
srun --ntasks=1 --nodes=1 --cpus-per-task=$SLURM_CPUS_PER_TASK bash -c "sleep 20; echo 'hello 2'" &
wait

where the increased complexity (with a three-fold granularity in how to combine task into scripts while also capping the maximum of parallel executions inside a SLURM job) comes from external constraints we have.

  1. On the clusterfutures side, we are planning to extend our custom version of SlurmExecutor with additional methods that should mostly repeat a few basic operations (checking whether an output pickle file exists, checking whether execution was successful or raised an exception) N times, with N being the number of tasks that are combined in a single submission script. We are not planning to address re-scheduling.

tcompa avatar Mar 20 '23 12:03 tcompa