batchtools
batchtools copied to clipboard
Error in readLog() on Slurm cluster
Problem
I'll cross-post a related feature request for future.batchtools, but I'm reliably getting:
Error : Log file ... for job with id 1 not available
when submitting jobs using future.bachtools on a small slurm cluster.
Importantly, it only ever happens when a new machine has to be recruited. Because I allow certain partitions to scale down to 0 machines when not in use I reliably get this error the first time I submit code to these partitions, either by future.batchtools, or directly from batchtools. Jobs run as expected once a node is running & idle.
The log files do exist, and the jobs continue to run on the cluster after batchtoools::readLogs errors.
Environment
Slurm cluster on Google compute engines, set up like this with near-default config, but with preemptible partitions which can scale down to 0 machines.
SessionInfo
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /apps/R/4.1.0/lib64/R/lib/libRblas.so
LAPACK: /apps/R/4.1.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 googleCloudStorageR_0.6.0
[5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.3
[9] tidyverse_1.3.1 brms_2.15.0 Rcpp_1.0.6 dplyr_1.0.6
[13] future.batchtools_0.10.0 batchtools_0.9.15 stringi_1.6.2 future_1.21.0
[17] stantargets_0.0.2.9000 tarchetypes_0.2.0 targets_0.4.2 aws.s3_0.3.21
Template
#!/bin/bash
## Job Resource Interface Definition
##
## ntasks [integer(1)]: Number of required tasks,
## Set larger than 1 if you want to further parallelize
## with MPI within your job.
## ncpus [integer(1)]: Number of required cpus per task,
## Set larger than 1 if you want to further parallelize
## with multicore/parallel within each task.
## walltime [integer(1)]: Walltime for this job, in seconds.
## Must be at least 60 seconds for Slurm to work properly.
## memory [integer(1)]: Memory in megabytes for each cpu.
## Must be at least 100 (when I tried lower values my
## jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.
<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
## Initialize work environment like
## source /etc/profile
## module add ...
## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>
<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
Potential Fix
Jobs reliably take 60 seconds (almost always) to an outside 69 seconds to begin after submission when a partition has to recruit a new machine. This means the waitForFile times out with the default fs.latency=65.
I've played around with both scheduler.latency & fs.latency, and extending fs.latency even to 260 seconds doesn't solve the problem (why? Is waitForFile getting an unexpected value for path
by trying to run fs::path_dir(fn)
before the machine is fully set up?). My problem is solved by increasing scheduler.latency to 60-70 secnods, which allows me to drop fs.latency down to 10-20.
This solves my problem, but makes batchtools slow to recognise that the job has completed.
Feature Request
Divide scheduler.latency into two figures - one for the initial sleep, and one for subsequent responses.
This problem has been solved by the pull request here, though it may reveal another bug at future.batchtools
as outlined here.
After this bugfix I can run batchtools jobs on a slurm cluster as expected, even when the machine needs to be provisioned, but despite adapting my slurm cluster nfs settings to sync as hinted at here, I still need to use a fs.latency of 70-80 seconds though I can now use a scheduler.latency of 1.
@stuvet: Thanks for looking into this in more detail! I am currently experiencing seemingly the same issue (Error: Log file '[..]' for job with id 1 not available
) with LSF.
Manually setting scheduler.latency = 60
in makeClusterFunctionsLSF
did also solve the issue in my case.
As some of the listed PRs are SLURM specific (https://github.com/mllg/batchtools/pull/277/files) while others should also affect LSF (https://github.com/mllg/batchtools/pull/276/files), I am wondering whether you'd have a suggestion which changes would need to be applied to the LSF system as well.
Hi @kpj I'm sorry I have no idea about LSF, but I was mistaken thinking that I needed fs.latency = 60
- this hid the problem for a while, but it wasn't the root cause.
The main problem for me was the mismatch between the full log file path passed to waitForFile
and the basename
recovered by list.files
without the full.names
argument. It would only hit that issue if the log file hadn't already been registered when readLog
was first called, so I suspect there was a race condition that only manifested when I had to wait for new workers to be recruited. I also tried to change some of the network mount flags in my Slurm cluster to try to reduce filesystem latency. I'm not sure if this helped, but It may be worth you checking this if you have access.
I also noticed that some of the Slurm status codes were missing from makeClusterFunctionsSlurm
though I think this resulted in a different error. Perhaps check you're on the most recent version of batchtools
, and check that the makeClusterFunctionsLSF listJobs
calls are likely to return all the jobs you might expect? On that note, do you see a delay between submitting the job and the job running on LSF? Does the job continue to run after batchtools
has hit the error? If so, I'd be suspicious that the listJobs
or listJobsQueued
functions in makeClusterFunctionsLSF
aren't finding all the jobs. Do you have any way to test this? If you find this case, you could get around it by pass a custom set of functions to makeClusterFunctions
rather than relying on the default makeClusterFunctionsLSF
.
Sorry I can't be more helpful, but hopefully it gives you some options about where to look next.
Thanks a lot for your response!
I also tried to change some of the network mount flags in my Slurm cluster to try to reduce filesystem latency. I'm not sure if this helped, but It may be worth you checking this if you have access.
Unfortunately, I don't have access and am used to network latency related issues.
check you're on the most recent version of batchtools
I cloned mllg/batchtools and am on the latest commit (63a6f812927f6dc78595abbffac0f6bc9651f67a).
check that the makeClusterFunctionsLSF listJobs calls are likely to return all the jobs you might expect
When I print stri_extract_first_regex(tail(res$output, -1L), "\\d+")
in makeClusterFunctionsLSF$listJobs
(i.e. what the function would return), it does indeed look a bit weird. Sometimes there are NA entries. I'll have to look into this in more detail to understand why and if that's an issue. But this just executes LSF's bjobs
so not too much should be able to go wrong.
do you see a delay between submitting the job and the job running on LSF?
What kind of delay do you mean? makeClusterFunctionsLSF$submitJob
seems to only call runOSCommand("bsub", stdin = outfile)
so any delay (i.e. the job being PENDING before RUNNING) should be due to LSF scheduling. But yes, this kind of delay I do indeed observe.
Does the job continue to run after batchtools has hit the error?
Yes. E.g., right now I still have a few jobs running while my furrr::future_map_dfr
call crashed over an hour ago.
Sorry I can't be more helpful, but hopefully it gives you some options about where to look next.
No worries, you already gave me some useful pointers :-)
What kind of delay do you mean?
In my case it was a provisioning delay, because the Slurm partition autoscaled down to 0 workers when not in use, so there was 60-90 sec when the worker was being recruited before a job could start. During this 60 sec Slurm would return a pending status code for the job that wasn't being handled properly by makeClusterFunctionsSlurm
, & batchtools would exit with an error while the job continued to run on Slurm.
Sometimes there are NA entries.
Sounds like it may be worth chasing up that makeClusterFunctionsLSF$listJobs
NA values to see if they'd be handled correctly by batchtools
, or more specifically that all possible LSF job status codes are captured either via listJobsRunning
or via listJobsQueued
. One of my issues came because future.batchtools
(which operates via batchtools
) wasn't handling missing status values gracefully as it assumed the process had exited, whilst if called directly batchtools
would have waited a bit longer before polling job status. Despite that, the best solution was to update the Slurm queries in makeClusterFunctionsSlurm
to map all possible Slurm status codes to either the listJobsRunning
or listJobsQueued
functions.
Best of luck with it! This took me ages to debug, especially since I had to track these errors through 3 different R packages & troubleshoot an HPC system that I didn't understand at all!
In my case it was a provisioning delay,
I see, very interesting! I think this is not the case for me.
Sounds like it may be worth chasing up that makeClusterFunctionsLSF$listJobs NA values
When I explicitly removed NA entries from the list of returned job ids, I ended up running into the BatchtoolsExpiration
which you and a few others apparently also encountered before. I ended up creating a new issue for that on mllg/batchtools (https://github.com/mllg/batchtools/issues/285).
future.batchtools (which operates via batchtools) wasn't handling missing status values gracefully
As far as I can tell makeClusterFunctionsLSF
is not really using explicit status values (except when determining whether no jobs exist) and uses the -r
and -p
commandline parameters of bjobs
instead.
I'll have to check what the implications of this are.
Best of luck with it! This took me ages to debug, especially since I had to track these errors through 3 different R packages & troubleshoot an HPC system that I didn't understand at all!
Thanks a lot, this is very reassuring 😄
I ended up running into the
BatchtoolsExpiration
Now you're definitely down the right track! (I hope...). I'd leave the NAs in, and try to work out how they're being produced - do those -r
& -p
flags bring up all possible jobs, or do you have permission to call them as $USER
?
As far as I can tell makeClusterFunctionsLSF is not really using explicit status values (except when determining whether no jobs exist) and uses the -r and -p commandline parameters of bjobs instead. I'll have to check what the implications of this are.
Options for bjobs
- Looking at this, I'd take a look at the -q
flag as I'd bet that these fall through the -p
& -r
net, & it'd be a quick fix if so. It might also be worth checking the -pe
, -pi
& -pei
flags to make sure these are all captured by -p
. It also looks like suspended jobs would fall through the net, but this probably isn't common.
Sorry for the late response, our cluster was down...
I'd leave the NAs in, and try to work out how they're being produced
The NAs apparently come from the interspersed message New job is waiting for scheduling;
when calling bjobs with -p
:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
212458300 [..] PEND gpu.4h [..] - job615bf5b3fea0c6005c9bd24770e18aea Apr 7 15:46
New job is waiting for scheduling;
212458315 [..] PEND gpu.4h [..] - jobad05dfe510402c7a7c99375441024b54 Apr 7 15:47
New job is waiting for scheduling;
212458328 [..] PEND gpu.4h [..] - jobffeb247934663944e4144a59c46a333b Apr 7 15:47
New job is waiting for scheduling;
do those -r & -p flags bring up all possible jobs, or do you have permission to call them as $USER?
The whole command being executed is actually bjobs -u $USER -w -p
, so I do only observe my own jobs.
I'd take a look at the -q flag as I'd bet that these fall through the -p & -r net
-q
sets the queue for which to list jobs. How could that help in this case?
It might also be worth checking the -pe, -pi & -pei flags to make sure these are all captured by -p.
In my tests, -p
did indeed show the union of -pe
, -pi
, -pei
so this seems fine as well.
It also looks like suspended jobs would fall through the net, but this probably isn't common.
That definitely something to check later on, but for now none of my jobs get suspended while triggering the crash.
-q sets the queue for which to list jobs. How could that help in this case?
Ah I misread the -q
description - I'd understood it to list queued jobs.
The NAs apparently come from the interspersed message New job is waiting for scheduling; when calling bjobs with -p:
Interesting. I wonder if the batchtools registry sees 3 jobs in your example, or 6 (3 without the appropriate job ids & statuses etc)?
Interesting. I wonder if the batchtools registry sees 3 jobs in your example, or 6 (3 without the appropriate job ids & statuses etc)?
To get the list of running jobs, batchtools looks at the result of chsetdiff
:
https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/clusterFunctions.R#L271-L276
chsetdiff
is defined as follows (where chmatch
apparently comes from data.table):
https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/helpers.R#L130-L133
Simulating a call to getBatchIds(reg, status = "queued"
would result in the evaluation of something like chsetdiff(c("a", "b", NA, NA), c())
which indeed yields "a" "b" NA NA
.
But in the end they remove NA entries so it seems all good: https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/clusterFunctions.R#L279
I guess I have to keep searching 😬
FWIW, I gave up on debugging this complex network of packages and ended up implementing my own LSF scheduling wrapper. It is not super sophisticated but works for me and allows writing reasonably concise code such as
foo <- 42
cluster_apply(data.frame(i = 1:3, group = c('A', 'B', 'C')), function(i, group) {
log_debug("hello")
data.frame(group = group, i = i, foo = foo, result = foo + 2 * i)
}, .packages = c(logger))
FWIW, in the next release of future.batchtools, you can tweak the underlying cluster functions parameters, e.g.
plan(batchtools_slurm, scheduler.latency = 60, fs.latency = 20)
Until on CRAN, you can install the develop version using:
remotes::install_github("HenrikBengtsson/future.batchtools", ref="develop")
I'm not sure if this fixes any of the problems here, but thought I comment in case it does.