easybuild-framework
easybuild-framework copied to clipboard
Slurm JobID parsing failed.
$ eb NVHPC-22.7-CUDA-11.7.0.eb --job --job-backend=Slurm --job-max-walltime=4 --job-cores=4
== Temporary log file in case of crash /scratch_local/eb-4ztatilm/easybuild-7qbf3o78.log
== found valid index for /m100_work/FUSIO_ru6IPP_0/thayward/eb/software/EasyBuild/4.6.0/easybuild/easyconfigs, so using it...
ERROR: Failed to determine job ID from output of submission command: sbatch: no partition specified, using default partition m100_all_serial
Submitted batch job 7787202
I guess it tries to parse sbatch: no partition specified, using default partition m100_all_serial to get the jobid?
I guess the --parsable flag to sbatch might help?
Looks like the useful info is going to stdout in both cases, and the stderr message is causing the problem.
[thayward@login03 NVHPC]$ sbatch -n 1 --parsable tmp.sh
sbatch: no partition specified, using default partition m100_all_serial
7787232
[thayward@login03 NVHPC]$ sbatch -n 1 --parsable tmp.sh 2> /dev/null
7787233
[thayward@login03 NVHPC]$ sbatch -n 1 tmp.sh
sbatch: no partition specified, using default partition m100_all_serial
Submitted batch job 7787234
[thayward@login03 NVHPC]$ sbatch -n 1 tmp.sh 2> /dev/null
Submitted batch job 7787235
For the time being, I can work around this with environment variables:
(export SBATCH_PARTITION=m100_all_serial; sbatch -n 1 tmp.sh)
Submitted batch job 7787260
That is from https://github.com/easybuilders/easybuild-framework/blob/d7409b3d68516a0bf70688d14e5750dedfbb8758/easybuild/tools/job/slurm.py#L119
jobid_regex = re.compile("^Submitted batch job (?P<jobid>[0-9]+)")
My instinct is to try removing the ^ from the regex.
Sounds reasonable. In the absence of being able to parse only the stdout, I guess that would be the only easy fix?