batchtools icon indicating copy to clipboard operation
batchtools copied to clipboard

Make batch.id robust to warning messages from sbatch

Open bwcompton opened this issue 6 months ago • 11 comments

I ran into a crazy bug today: getJobStatus gave me batch.id = "that". It turns out that when I requested a large amount of memory, sbatch returned this um, helpful message:

sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory 
because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
Submitted batch job 38139957

clusterFunctionsSlurm was pulling the 4th word of the first line, which should have been the Slurm jobid, but instead was "that". It wanted, of course, the last line.

This really isn't a bug in batchtools, as the sysops inserted an informational message in a crazy place. But I suspect if the smart, on the ball people at the UMass Unity cluster are doing this, others probably are too. It'd be nice for batchtools to be robust to such shenanigans. Alternatively, I suppose it could throw an error if batch.id is non-numeric and print the message from sbatch.

My suggested change looks for a line beginning with "Submitted batch job" and pulls the 4th word as the batch.id.

I've tested this change against the following:

output <- 'Submitted batch job 12345678'
output <- 'This is a crazy informational message\nSubmitted batch job 98765432'
output <- 'This is crazy\nand uncalled for\nSubmitted batch job 5555555\nand even more stuff'

as well as against real-life submitJobs calls, both with and without the informational message.

bwcompton avatar Jun 13 '25 02:06 bwcompton

You might want to create an issue for this that reference this pull request. At least I tend to miss or forget about PR-only issues over time, and I know other repos like an issue with details where discussions can take place.

Now, I had a look at runOSCommand(), which is what captures the output per

https://github.com/mlr-org/batchtools/blob/7763ed830548e590a2396b76e6c14a6d4c583620/R/runOSCommand.R#L44

That captures both stdout and stderr. It could be that it would be more sane if those two are captured separately, e.g. something like stdout = TRUE and stderr = "error.log", where the expected output should go to stdout and info messages to stderr. To test if that would have helped you, if you do

$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log

what does

$ cat stdout.log
$ cat stderr.log

output? With Slurm, you should see "Submitted batch job ..." in stdout.log. Now, my hope is that "sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit." ends up in stderr.log for you.

HenrikBengtsson avatar Sep 05 '25 21:09 HenrikBengtsson

Nice!

bcompton_umass_edu@login1:~$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.log
bcompton_umass_edu@login1:~$ cat stdout.log
Submitted batch job 42933105
bcompton_umass_edu@login1:~$ cat stderr.log
sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.
bcompton_umass_edu@login1:~$

It looks like you can do a cleaner fix than what I came up with.

bwcompton avatar Sep 10 '25 20:09 bwcompton

I've been prototyping with a more flexible runOSCommand() in my future.batchtools package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.

@bwcompton , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to batchtools, plus adjustments to makeClusterFunctionSlurm(), which I also patch in future.batchtools.

To try it out, install it as:

remotes::install_github("futureverse/future.batchtools", ref="develop")

and then try it as:

library(future)
plan(future.batchtools::batchtools_slurm)
f <- future({  Sys.info()[["nodename"]] })
v <- value(f)
print(v)

See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html for how to control sbatch resource specifications.

HenrikBengtsson avatar Sep 12 '25 04:09 HenrikBengtsson

Thanks! I tried your code snippet, and it can't find slurm_script. Am I missing something?

Brad

library(future)> plan(future.batchtools::batchtools_slurm)> f <- future({ Sys.info()[["nodename"]] })> v <- value(f)Error: Future () of class BatchtoolsSlurmFuture expired, which indicates that it crashed or was killed. Post-mortem details: Future state: ‘running’ Batchtools status: ‘defined’, ‘expired’, ‘submitted’ Slurm job ID: [n=1] ‘43049392’ Slurm 'squeue' job status: Slurm 'sacct' job status: 43049392|FAILED|1:0 The last few lines of the logged output: Session information:

  • timestamp: 2025-09-12 14:36:54+0000
  • hostname: cpu016
  • Rscript path: /var/spool/slurm/slurmd/job43049392/slurm_script: line 20: Rscript: command not found
  • Rscript version: /var/spool/slurm/slurmd/job43049392/slurm_script: line 21: Rscript: command not found
  • Rscript library paths: Rscript -e 'batchtools::doJobCollection()' ...
  • job name: 'jobb9686511f15322fe9d3568b52c61e703'
  • job log file: '/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380/logs/jobb9686511f15322fe9d3568b52c61e703.log'
  • job uri: '/work/pi_cschwe In addition: Warning messages: 1: batchtools::waitForJobs(..., timeout = 2592000) returned FALSE 2: In delete.BatchtoolsFuture(future) : Will not remove batchtools registry, because the status of the batchtools was ‘error’, ‘defined’, ‘expired’, ‘submitted’ and future backend argument 'delete' is ‘on-success’: ‘/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380’>

On Fri, Sep 12, 2025 at 12:40 AM Henrik Bengtsson @.***> wrote:

HenrikBengtsson left a comment (mlr-org/batchtools#314) https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358099900%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XSa2XbNjVl2pEPjiaPXiUSbZBlFeMfOnjzt%2BWHgnS4c%3D&reserved=0

I've been prototyping with a more flexible runOSCommand() in my future.batchtools package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.

@bwcompton https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbwcompton&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358131610%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=vTGFeNjU5AT84YQi7cImnSLAgErc%2FccVCsEk7YVPUX8%3D&reserved=0 , although it's future.batchtools and not batchtools, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to batchtools, plus adjustments to makeClusterFunctionSlurm(), which I also patch in future.batchtools.

To try it out, install it as:

remotes::install_github("futureverse/future.batchtools", ref="develop")

and then try it as:

library(future) plan(future.batchtools::batchtools_slurm)f <- future({ Sys.info()[["nodename"]] })v <- value(f) print(v)

See https://future.batchtools.futureverse.org/reference/batchtools_slurm.html https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffuture.batchtools.futureverse.org%2Freference%2Fbatchtools_slurm.html&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358143281%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=u%2BGwQhkidnbRGl%2B7%2BEhIoDeTG3Ad4EtkBfRWJW8y1PQ%3D&reserved=0 for how to control sbatch resource specifications.

— Reply to this email directly, view it on GitHub https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358155056%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=GUmEXkgvmyPWWMJhaP1xc%2Btun4fBFDFOIhHQGag6NsQ%3D&reserved=0, or unsubscribe https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUIZI2VZFGCGL3NUUAKXKZL3SJFD3AVCNFSM6AAAAAB7G4SBCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBTGYZTIMZXGE&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358166124%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Qg2x%2FPh2UFME%2FwznQtEl24kLIxJHVvEeoj7KqoM0d0I%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

bwcompton avatar Sep 12 '25 14:09 bwcompton

Rscript: command not found

R is not available by default in your jobs. Do you load an environment module to get access to R? If so, specify that I'm in the resources argument, e.g.

plan(future.batchtools::batchtools_slurm, resources = list(modules = "r"))

This is illustrated also in https://future.batchtools.futureverse.org/reference/batchtools_slurm.html

If you use other techniques to make R available in a job script, please let me know

HenrikBengtsson avatar Sep 12 '25 15:09 HenrikBengtsson

That said, the job submission itself actually worked! It's just that R didn't start, which means the patch works

HenrikBengtsson avatar Sep 12 '25 15:09 HenrikBengtsson

Great news that the patch works.

Here's what I've got in my template, slurm.tmpl. I'm not sure how to squeeze this into the resources option--this is something I got help with from a sysadmin. It works great with batchtools.

## Call batchtools inside container
module load apptainer/latest
export APPTAINER_BINDPATH="/run/munge,/var/run/munge,/etc/slurm,/var/spool/slurm/slurmd/conf-cache/slurm.conf,$APPTAINER_BINDPATH"

apptainer exec /modules/admin-resources/ood-dev/unity-r_4.4.0.sif Rscript --no-restore --quiet --no-save -e 'batchtools::doJobCollection("<%= uri %>")'

bwcompton avatar Sep 12 '25 16:09 bwcompton

I'm not sure how to squeeze this into the resources option

Unfortunately not possible today; you'd have to create your own custom template file. But, I've created https://github.com/futureverse/future.batchtools/issues/99 to add support for this too. Stay tuned.

HenrikBengtsson avatar Sep 12 '25 17:09 HenrikBengtsson

Okay, I'll look forward to future.batchtools in the future.

Do you have what you need from me to address the original issue in this PR?

bwcompton avatar Sep 12 '25 17:09 bwcompton

Do you have what you need from me to address the original issue in this PR?

Yes, I'd like to have a success story over at future.batchtools first, ideally some mileage from other users, and have my patch "ripe" enough, before I "bug" the batchtools maintainers here. So, I'll ping you again over at https://github.com/futureverse/future.batchtools/issues/99 for you to test. Thanks.

HenrikBengtsson avatar Sep 12 '25 17:09 HenrikBengtsson

Deal! Thanks so much for your help with this.

bwcompton avatar Sep 12 '25 17:09 bwcompton