batchtools icon indicating copy to clipboard operation
batchtools copied to clipboard

Implement `resubmitJobs()`

Open mllg opened this issue 7 years ago • 2 comments

This will restart all jobs using the same resources (to fix #164) and defaults to expired jobs (as requested via mail). Can also be used in bt[lm]apply() to "resume" calculation.

mllg avatar Feb 09 '18 18:02 mllg

I have some of my own code for doing this as a rough guide. This will resubmit the jobs multiple times until it hits a max_retries number.

message("Submitting unsubmitted jobs...")
batchtools::submitJobs(batchtools::findNotSubmitted()$job.id,
                       resources=res)


message("Waiting for jobs to complete...")
message(Sys.time())
job_retries = sapply(batchtools::findJobs()$job.id, function(x) {0})
while(length(batchtools::findNotDone()$job.id)>0){
	batchtools::waitForJobs(timeout=60)
	err = batchtools::findErrors()$job.id
	exp = batchtools::findExpired()$job.id
	if (length(err)>0){
		for(i in err){
			job_retries[i] = job_retries[i] + 1
			message(paste0("Found error in job ",i,
			   ", restarting, retry attempt ",job_retries[i]))
			print(batchtools::getErrorMessages(i))
			batchtools::submitJobs(i,resources=res)
		}
	}
	if (length(exp)>0){
		for(i in exp){
			job_retries[i] = job_retries[i] + 1
			message(paste0("Found expired job ",i,
			   ", restarting with ",1.25**job_retries[i],
			   "x more resources, retry attempt ",job_retries[i]))
			res_job = res
			res_job$memory = round(res$memory*(1.25**job_retries[i]))
			res_job$walltime = round(res$walltime*(1.25**job_retries[i]))
			res_job$cores = round(res$cores*(1.25**job_retries[i]))
			#print(batchtools::getLog(i))
			batchtools::submitJobs(i,resources=res_job)
		}
	}
	if(max(job_retries)>=maxRetries){
		message("Maximum number of retries exceeded, stopping jobs...")
		batchtools::killJobs()
		reg <<- reg
		stop("Automatic retry failed, registry available for debugging at `reg`.")
	}
}

ryananeff avatar Aug 08 '19 18:08 ryananeff

Are there any plans to implement this? One of snakemake's best features is that it can resubmit jobs with increased user-defined resources (eg., mem = attempt ** 3 + 10, with attempt iterating by 1 for each job attempt).

Using batchtools or clustermq, which don't have such a resubmit feature (AFAIK) can result in a lot of hassle when X% of 100's or 1000's of jobs are unsuccessful. I have to figure our which jobs failed, resubmit just those jobs with more resources, see which of those jobs failed, resubmit the failed jobs with more resources, etc.

For those that need/want resubmission of jobs and stumble upon this issue: I highly recommend snakemake (which can run R code), but it is often overkill for simpler tasks.

nick-youngblut avatar Feb 08 '21 19:02 nick-youngblut