ufs-weather-model
ufs-weather-model copied to clipboard
Incorrect job status returned from squeue in rt.sh
I am running a regression test on Gaea and I noticed that some test jobs fail due to wall clock timeout error but the status of those jobs is incorrectly interpreted by rt scripts. This is part of the log file:
++ squeue -u Dusan.Jovic -j 135087666
+ job_info='JOBID CLUSTER PARTITION QOS USER STATE TIME_LEFT NODES NAME
135087666 c5 batch normal Dusan.Jovic COMPLETININVALID 3 run_cpld_debug_pdlib'
+ grep -q 135087666
+ job_running=true
++ grep 135087666
+ status='135087666 c5 batch normal Dusan.Jovic COMPLETININVALID 3 run_cpld_debug_pdlib'
++ awk '{print $5}'
+ status=Dusan.Jovic
+ case ${status} in
+ status_label=Unknown
+ echo 'rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic'
rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic
+ echo 'rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed'
rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed
+ echo '34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)'
34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)
In this specific case the job status is 'COMPLETININVALID', but the script gets the status 'Dusan.Jovic'. Obviously wrong. This happens because the default format of the squeue command is not the same across different systems running slurm scheduler:
On Hera:
$ squeue -u Dusan.Jovic -j 63947991
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
63947991 hera compile_atm_dyn32_intel Dusan.Jovic RUNNING 0:36 30:00 1 h20c01
On Gaea:
$ squeue -u Dusan.Jovic -j 135087764
JOBID CLUSTER PARTITION QOS USER STATE TIME_LEFT NODES NAME
135087764 c5 batch normal Dusan.Jovic RUNNING 17:23 3 run_cpld_debug_pdlib
On Hera job state is in column 5, while on Gaea it is in column 6.
Instead of just running squeue
with the default format we must explicitly set the desired format in order to get the same output on all platforms. For example, I suggest:
squeue -u "${USER}" -j "${jobid}" -o '%i %T'
In this case the first column will always be JOBID and the second column will always be STATE.