partial results when SLURM cancels jobs due to time limit
Hi @mschubert thanks for maintaining clustermq, which seems interesting!
I was wondering if it is possible to retreive partial results from SLURM even if a job times out?
For example consider this code
myfun <- function(x){
Sys.sleep(10)
x*2
}
minutes.per.job <- 1
result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)
with this SLURM template
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ megabytes | 200 }}
#SBATCH --time={{ minutes | 30 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
I get output like this:
> result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)
Submitting 2 worker jobs to SLURM as 'cmq8454' ...
Running 20 calculations (5 objs/20.2 Kb common; 1 calls/chunk) ...
[=====================================>----------------] 70% (2/2 wrk) eta: 32s
Error: Socket timed out after 60053 ms
Master: [2.3 mins 0.0% CPU]; Worker: [avg 2.1% CPU, max 227.8 Mb]
!> result.list
Error: object 'result.list' not found
I set a time limit of 1 minute per job, with 2 jobs, but each job takes 10 seconds, so there is not enough time to do them all. I get an error from Q(), but I wonder if there is a way to get a list of results for the ones which are possible to compute under the time limit? (and NULL otherwise for the jobs which timed out?)
Thanks!!
This is currently not supported, and we need you to request enough resources from the scheduler to complete your computations.
Do you have a use case for when such functionality would be required?
thanks for your response, that is quite helpful! yes I have a use case, that is large machine learning experiments in which we have lots of combinations of algorithms and data sets, and we don't really know in advance how long each combination is going to take. initially we can submit them all with a time limit of say 24 hours. many jobs will finish under that time limit, and we can hopefully save and analyze those results, even if we don't get the other longer job results. for the ones that go over the time limit, we just submit the same job again with the remaining jobs, or try increasing the time limit.
Ok, I see.
It seems you have the following options:
-
Request more time overall. This is the usual strategy when requesting HPC resources: be sure to request enough time so you are sure your jobs finish, because there is an asymmetric cost (reserving too much is usually very cheap compared to a job being cancelled)
-
Call
Qmultiple times, each with a subset of your data. In this case, you wouldn't need to check which jobs didn't finish, all would be submitted and return within the time limit -
Use the
max_calls_workerargument and submit more jobs. This way, you can limit each job to only work on n calls, so each worker stays under the time limit
We could also consider making fail_on_error = FALSE return NULL for timed-out jobs.
"fail_on_error = FALSE return NULL for timed-out jobs." -> this would work.