ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

Incorrect job status returned from squeue in rt.sh

Open DusanJovic-NOAA opened this issue 6 months ago • 16 comments

I am running a regression test on Gaea and I noticed that some test jobs fail due to wall clock timeout error but the status of those jobs is incorrectly interpreted by rt scripts. This is part of the log file:

++ squeue -u Dusan.Jovic -j 135087666
+ job_info='JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
+ grep -q 135087666
+ job_running=true
++ grep 135087666
+ status='135087666   c5       batch      normal      Dusan.Jovic         COMPLETININVALID     3      run_cpld_debug_pdlib'
++ awk '{print $5}'
+ status=Dusan.Jovic
+ case ${status} in
+ status_label=Unknown
+ echo 'rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic'
rt_utils.sh: *** WARNING ***: Job status unsupported: Dusan.Jovic
+ echo 'rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed'
rt_utils.sh: *** WARNING ***: Status might be non-terminating, please manually stop if needed
+ echo '34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)'
34 min. Slurm Job 135087666 Status: Unknown (Dusan.Jovic)

In this specific case the job status is 'COMPLETININVALID', but the script gets the status 'Dusan.Jovic'. Obviously wrong. This happens because the default format of the squeue command is not the same across different systems running slurm scheduler:

On Hera:

$ squeue -u Dusan.Jovic -j 63947991
    JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
 63947991 hera       compile_atm_dyn32_intel  Dusan.Jovic      RUNNING      0:36      30:00     1 h20c01

On Gaea:

$ squeue -u Dusan.Jovic -j 135087764
JOBID       CLUSTER  PARTITION  QOS         USER                STATE    TIME_LEFT   NODES  NAME
135087764   c5       batch      normal      Dusan.Jovic         RUNNING  17:23       3      run_cpld_debug_pdlib

On Hera job state is in column 5, while on Gaea it is in column 6.

Instead of just running squeue with the default format we must explicitly set the desired format in order to get the same output on all platforms. For example, I suggest:

squeue -u "${USER}" -j "${jobid}" -o '%i %T'

In this case the first column will always be JOBID and the second column will always be STATE.

DusanJovic-NOAA avatar Jul 26 '24 22:07 DusanJovic-NOAA