Make batch.id robust to warning messages from sbatch
I ran into a crazy bug today: getJobStatus gave me batch.id = "that". It turns out that when I requested a large amount of memory, sbatch returned this um, helpful message:
sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory
because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
Submitted batch job 38139957
clusterFunctionsSlurm was pulling the 4th word of the first line, which should have been the Slurm jobid, but instead was "that". It wanted, of course, the last line.
This really isn't a bug in batchtools, as the sysops inserted an informational message in a crazy place. But I suspect if the smart, on the ball people at the UMass Unity cluster are doing this, others probably are too. It'd be nice for batchtools to be robust to such shenanigans. Alternatively, I suppose it could throw an error if batch.id is non-numeric and print the message from sbatch.
My suggested change looks for a line beginning with "Submitted batch job" and pulls the 4th word as the batch.id.
I've tested this change against the following:
output <- 'Submitted batch job 12345678'
output <- 'This is a crazy informational message\nSubmitted batch job 98765432'
output <- 'This is crazy\nand uncalled for\nSubmitted batch job 5555555\nand even more stuff'
as well as against real-life submitJobs calls, both with and without the informational message.
You might want to create an issue for this that reference this pull request. At least I tend to miss or forget about PR-only issues over time, and I know other repos like an issue with details where discussions can take place.
Now, I had a look at runOSCommand(), which is what captures the output per
https://github.com/mlr-org/batchtools/blob/7763ed830548e590a2396b76e6c14a6d4c583620/R/runOSCommand.R#L44
That captures both stdout and stderr. It could be that it would be more sane if those two are captured separately, e.g. something like stdout = TRUE and stderr = "error.log", where the expected output should go to stdout and info messages to stderr. To test if that would have helped you, if you do
$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log
what does
$ cat stdout.log
$ cat stderr.log
output? With Slurm, you should see "Submitted batch job ..." in stdout.log. Now, my hope is that "sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit." ends up in stderr.log for you.
Nice!
bcompton_umass_edu@login1:~$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log
bcompton_umass_edu@login1:~$ cat stdout.log
Submitted batch job 42933105
bcompton_umass_edu@login1:~$ cat stderr.log
sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
bcompton_umass_edu@login1:~$
It looks like you can do a cleaner fix than what I came up with.
I've been prototyping with a more flexible runOSCommand() in my future.batchtools package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.
@bwcompton , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to batchtools, plus adjustments to makeClusterFunctionSlurm(), which I also patch in future.batchtools.
To try it out, install it as:
remotes::install_github("futureverse/future.batchtools", ref="develop")
and then try it as:
library(future)
plan(future.batchtools::batchtools_slurm)
f <- future({ Sys.info()[["nodename"]] })
v <- value(f)
print(v)
See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html for how to control sbatch resource specifications.
Thanks! I tried your code snippet, and it can't find slurm_script. Am I missing something?
Brad
library(future)> plan(future.batchtools::batchtools_slurm)> f <- future({ Sys.info()[["nodename"]] })> v <- value(f)Error: Future (
) of class BatchtoolsSlurmFuture expired, which indicates that it crashed or was killed. Post-mortem details: Future state: ‘running’ Batchtools status: ‘defined’, ‘expired’, ‘submitted’ Slurm job ID: [n=1] ‘43049392’ Slurm 'squeue' job status: Slurm 'sacct' job status: 43049392|FAILED|1:0 The last few lines of the logged output: Session information:
- timestamp: 2025-09-12 14:36:54+0000
- hostname: cpu016
- Rscript path: /var/spool/slurm/slurmd/job43049392/slurm_script: line 20: Rscript: command not found
- Rscript version: /var/spool/slurm/slurmd/job43049392/slurm_script: line 21: Rscript: command not found
- Rscript library paths: Rscript -e 'batchtools::doJobCollection()' ...
- job name: 'jobb9686511f15322fe9d3568b52c61e703'
- job log file: '/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380/logs/jobb9686511f15322fe9d3568b52c61e703.log'
- job uri: '/work/pi_cschwe In addition: Warning messages: 1: batchtools::waitForJobs(..., timeout = 2592000) returned FALSE 2: In delete.BatchtoolsFuture(future) : Will not remove batchtools registry, because the status of the batchtools was ‘error’, ‘defined’, ‘expired’, ‘submitted’ and future backend argument 'delete' is ‘on-success’: ‘/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380’>
On Fri, Sep 12, 2025 at 12:40 AM Henrik Bengtsson @.***> wrote:
HenrikBengtsson left a comment (mlr-org/batchtools#314) https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358099900%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XSa2XbNjVl2pEPjiaPXiUSbZBlFeMfOnjzt%2BWHgnS4c%3D&reserved=0
I've been prototyping with a more flexible runOSCommand() in my future.batchtools package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.
@bwcompton https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbwcompton&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358131610%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=vTGFeNjU5AT84YQi7cImnSLAgErc%2FccVCsEk7YVPUX8%3D&reserved=0 , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to batchtools, plus adjustments to makeClusterFunctionSlurm(), which I also patch in future.batchtools.
To try it out, install it as:
remotes::install_github("futureverse/future.batchtools", ref="develop")
and then try it as:
library(future) plan(future.batchtools::batchtools_slurm)f <- future({ Sys.info()[["nodename"]] })v <- value(f) print(v)
See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffuture.batchtools.futureverse.org%2Freference%2Fbatchtools_slurm.html&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358143281%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=u%2BGwQhkidnbRGl%2B7%2BEhIoDeTG3Ad4EtkBfRWJW8y1PQ%3D&reserved=0 for how to control sbatch resource specifications.
— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358155056%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=GUmEXkgvmyPWWMJhaP1xc%2Btun4fBFDFOIhHQGag6NsQ%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUIZI2VZFGCGL3NUUAKXKZL3SJFD3AVCNFSM6AAAAAB7G4SBCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBTGYZTIMZXGE&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358166124%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Qg2x%2FPh2UFME%2FwznQtEl24kLIxJHVvEeoj7KqoM0d0I%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>
Rscript: command not found
R is not available by default in your jobs. Do you load an environment module to get access to R? If so, specify that I'm in the resources argument, e.g.
plan(future.batchtools::batchtools_slurm, resources = list(modules = "r"))
This is illustrated also in https://future.batchtools.futureverse.org/reference/batchtools_slurm.html
If you use other techniques to make R available in a job script, please let me know
That said, the job submission itself actually worked! It's just that R didn't start, which means the patch works
Great news that the patch works.
Here's what I've got in my template, slurm.tmpl. I'm not sure how to squeeze this into the resources option--this is something I got help with from a sysadmin. It works great with batchtools.
## Call batchtools inside container
module load apptainer/latest
export APPTAINER_BINDPATH="/run/munge,/var/run/munge,/etc/slurm,/var/spool/slurm/slurmd/conf-cache/slurm.conf,$APPTAINER_BINDPATH"
apptainer exec /modules/admin-resources/ood-dev/unity-r_4.4.0.sif Rscript --no-restore --quiet --no-save -e 'batchtools::doJobCollection("<%= uri %>")'
I'm not sure how to squeeze this into the resources option
Unfortunately not possible today; you'd have to create your own custom template file. But, I've created https://github.com/futureverse/future.batchtools/issues/99 to add support for this too. Stay tuned.
Okay, I'll look forward to future.batchtools in the future.
Do you have what you need from me to address the original issue in this PR?
Do you have what you need from me to address the original issue in this PR?
Yes, I'd like to have a success story over at future.batchtools first, ideally some mileage from other users, and have my patch "ripe" enough, before I "bug" the batchtools maintainers here. So, I'll ping you again over at https://github.com/futureverse/future.batchtools/issues/99 for you to test. Thanks.
Deal! Thanks so much for your help with this.