batchtools
batchtools copied to clipboard
Implement `resubmitJobs()`
This will restart all jobs using the same resources (to fix #164) and defaults to expired jobs (as requested via mail). Can also be used in bt[lm]apply()
to "resume" calculation.
I have some of my own code for doing this as a rough guide. This will resubmit the jobs multiple times until it hits a max_retries number.
message("Submitting unsubmitted jobs...")
batchtools::submitJobs(batchtools::findNotSubmitted()$job.id,
resources=res)
message("Waiting for jobs to complete...")
message(Sys.time())
job_retries = sapply(batchtools::findJobs()$job.id, function(x) {0})
while(length(batchtools::findNotDone()$job.id)>0){
batchtools::waitForJobs(timeout=60)
err = batchtools::findErrors()$job.id
exp = batchtools::findExpired()$job.id
if (length(err)>0){
for(i in err){
job_retries[i] = job_retries[i] + 1
message(paste0("Found error in job ",i,
", restarting, retry attempt ",job_retries[i]))
print(batchtools::getErrorMessages(i))
batchtools::submitJobs(i,resources=res)
}
}
if (length(exp)>0){
for(i in exp){
job_retries[i] = job_retries[i] + 1
message(paste0("Found expired job ",i,
", restarting with ",1.25**job_retries[i],
"x more resources, retry attempt ",job_retries[i]))
res_job = res
res_job$memory = round(res$memory*(1.25**job_retries[i]))
res_job$walltime = round(res$walltime*(1.25**job_retries[i]))
res_job$cores = round(res$cores*(1.25**job_retries[i]))
#print(batchtools::getLog(i))
batchtools::submitJobs(i,resources=res_job)
}
}
if(max(job_retries)>=maxRetries){
message("Maximum number of retries exceeded, stopping jobs...")
batchtools::killJobs()
reg <<- reg
stop("Automatic retry failed, registry available for debugging at `reg`.")
}
}
Are there any plans to implement this? One of snakemake's best features is that it can resubmit jobs with increased user-defined resources (eg., mem = attempt ** 3 + 10
, with attempt
iterating by 1 for each job attempt).
Using batchtools or clustermq, which don't have such a resubmit feature (AFAIK) can result in a lot of hassle when X% of 100's or 1000's of jobs are unsuccessful. I have to figure our which jobs failed, resubmit just those jobs with more resources, see which of those jobs failed, resubmit the failed jobs with more resources, etc.
For those that need/want resubmission of jobs and stumble upon this issue: I highly recommend snakemake (which can run R code), but it is often overkill for simpler tasks.