clustermq icon indicating copy to clipboard operation
clustermq copied to clipboard

partial results when SLURM cancels jobs due to time limit

Open tdhock opened this issue 7 months ago • 4 comments

Hi @mschubert thanks for maintaining clustermq, which seems interesting!

I was wondering if it is possible to retreive partial results from SLURM even if a job times out?

For example consider this code

myfun <- function(x){
  Sys.sleep(10)
  x*2
}
minutes.per.job <- 1
result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)

with this SLURM template

#!/bin/sh                                                                          
#SBATCH --job-name={{ job_name }}                                                  
#SBATCH --output={{ log_file | /dev/null }}                                        
#SBATCH --error={{ log_file | /dev/null }}                                         
#SBATCH --mem-per-cpu={{ megabytes | 200 }}                                        
#SBATCH --time={{ minutes | 30 }}                                                  
#SBATCH --array=1-{{ n_jobs }}                                                     
#SBATCH --cpus-per-task={{ cores | 1 }}                                            
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

I get output like this:

> result.list <- clustermq::Q(myfun, x=1:20, n_jobs=2, template=list(minutes=minutes.per.job, megabytes=3000), timeout=minutes.per.job*60)                         
 Submitting 2 worker jobs to SLURM as 'cmq8454' ...
 Running 20 calculations (5 objs/20.2 Kb common; 1 calls/chunk) ...
 [=====================================>----------------]  70% (2/2 wrk) eta: 32s



 Error: Socket timed out after 60053 ms
 Master: [2.3 mins 0.0% CPU]; Worker: [avg 2.1% CPU, max 227.8 Mb]
!> result.list
 Error: object 'result.list' not found

I set a time limit of 1 minute per job, with 2 jobs, but each job takes 10 seconds, so there is not enough time to do them all. I get an error from Q(), but I wonder if there is a way to get a list of results for the ones which are possible to compute under the time limit? (and NULL otherwise for the jobs which timed out?)

Thanks!!

tdhock avatar May 30 '25 17:05 tdhock

This is currently not supported, and we need you to request enough resources from the scheduler to complete your computations.

Do you have a use case for when such functionality would be required?

mschubert avatar Jun 03 '25 20:06 mschubert

thanks for your response, that is quite helpful! yes I have a use case, that is large machine learning experiments in which we have lots of combinations of algorithms and data sets, and we don't really know in advance how long each combination is going to take. initially we can submit them all with a time limit of say 24 hours. many jobs will finish under that time limit, and we can hopefully save and analyze those results, even if we don't get the other longer job results. for the ones that go over the time limit, we just submit the same job again with the remaining jobs, or try increasing the time limit.

tdhock avatar Jun 03 '25 21:06 tdhock

Ok, I see.

It seems you have the following options:

  1. Request more time overall. This is the usual strategy when requesting HPC resources: be sure to request enough time so you are sure your jobs finish, because there is an asymmetric cost (reserving too much is usually very cheap compared to a job being cancelled)

  2. Call Q multiple times, each with a subset of your data. In this case, you wouldn't need to check which jobs didn't finish, all would be submitted and return within the time limit

  3. Use the max_calls_worker argument and submit more jobs. This way, you can limit each job to only work on n calls, so each worker stays under the time limit

We could also consider making fail_on_error = FALSE return NULL for timed-out jobs.

mschubert avatar Jun 04 '25 08:06 mschubert

"fail_on_error = FALSE return NULL for timed-out jobs." -> this would work.

tdhock avatar Jun 04 '25 21:06 tdhock