cylc-flow Cylc 8.3.6 - submit timeout for a task that has already completed

Description

We saw an odd situation where job submission hit a weird kind of race-condition:

dn_um_createbc 01 moves into a preparing state: 2025-04-12T10:30:12Z INFO - [20250412T0800Z/dn_backup_frame:waiting] => preparing

PBS accepts 01 after 5 minutes

04/12/2025 10:35:00  L    Considering job to run
04/12/2025 10:35:00  S    Job Queued at request of user@host, owner = user@host, job name = 
dn_um_createbc.20250412T0800Z.workflow, queue = prod-ppn
04/12/2025 10:35:00  S    Job Modified at request of Scheduler@pbs-edc
04/12/2025 10:35:00  S    Job Run at request of Scheduler@pbs-edc on exec_vnode (casey-ppn04:ncpus=1:mem=2097152kb)
04/12/2025 10:35:00  L    Job run [...]

Cylc sees and records a [jobs-submit err] killed on timeout (PT10M)

2025-04-12T10:40:13Z ERROR - [jobs-submit cmd] cat /home/user/cylc-run/access_c4_dn/log/job/20250412T0800Z/dn_backup_frame/01/job | ssh -oBatchMode=yes -oConnectTimeout=10 prod-dm.ddns.australis.internal.bom.gov.au env CYLC_VERSION=8.3.6 CYLC_ENV_NAME=cylc-8.3.6 bash --login -c ''"'"'exec "$0" "$@"'"'"'' /g/sc/prod_deploy/prod/user/aps4_c/v1.6.5/aps4_c_dn/cylc-bin/cylc jobs-submit --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/access_c4_dn/log/job' 20250412T0800Z/dn_backup_frame/01
    [jobs-submit ret_code] -9
    [jobs-submit err] killed on timeout (PT10M)
2025-04-12T10:40:13Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 jobhost env CYLC_VERSION=8.3.6 CYLC_ENV_NAME=cylc-8.3.6 bash --login -c ''"'"'exec "$0" "$@"'"'"'' /g/sc/prod_deploy/prod/user/aps4_c/v1.6.5/aps4_c_dn/cylc-bin/cylc jobs-submit --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/access_c4_dn/log/job' 20250412T0800Z/dn_backup_frame/01
    [jobs-submit ret_code] 1
    [jobs-submit out] 2025-04-12T10:40:13Z|20250412T0800Z/dn_backup_frame/01|1
2025-04-12T10:40:13Z CRITICAL - [20250412T0800Z/dn_backup_frame/01:preparing] submission failed

Cylc prepares to resubmit 01 (again)

2025-04-12T10:40:13Z INFO - [20250412T0800Z/dn_backup_frame/01:preparing] => waiting
2025-04-12T10:40:13Z WARNING - [20250412T0800Z/dn_backup_frame:waiting] retrying in PT1M (after 2025-04-12T10:41:13Z)
2025-04-12T10:40:13Z ERROR - [jobs-submit cmd] cat /home/user/cylc-run/access_c4_dn/log/job/20250412T0800Z/dn_um_createbc/01/job | ssh -oBatchMode=yes -oConnectTimeout=10 jobhost env CYLC_VERSION=8.3.6 CYLC_ENV_NAME=cylc-8.3.6 bash --login -c ''"'"'exec "$0" "$@"'"'"'' /g/sc/prod_deploy/prod/user/aps4_c/v1.6.5/aps4_c_dn/cylc-bin/cylc jobs-submit --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/access_c4_dn/log/job' 20250412T0800Z/dn_um_createbc/01
    [jobs-submit ret_code] -9
    [jobs-submit out]
    [TASK JOB SUMMARY]2025-04-12T10:30:15Z|20250412T0800Z/dn_um_createbc/01|0|6552553
   [TASK JOB COMMAND]2025-04-12T10:30:15Z|20250412T0800Z/dn_um_createbc/01|[STDOUT] 6552553.pbs-edc
    [jobs-submit err] killed on timeout (PT10M)

Cylc then starts preparing 02:

2025-04-12T10:41:13Z INFO - [20250412T0800Z/dn_backup_frame:waiting] => waiting(queued)
2025-04-12T10:41:13Z INFO - [20250412T0800Z/dn_backup_frame:waiting(queued)] => waiting
2025-04-12T10:41:13Z INFO - [20250412T0800Z/dn_backup_frame:waiting] => preparing
2025-04-12T10:41:15Z INFO - [20250412T0800Z/dn_backup_frame/02:preparing] submitted to prod-dm_aps4_c_b:pbs[6552935]
2025-04-12T10:41:15Z INFO - [20250412T0800Z/dn_backup_frame/02:preparing] => submitted
2025-04-12T10:41:18Z INFO - [20250412T0800Z/dn_backup_frame/02:submitted] => running

PBS starts running this new submission^, it takes about 10 seconds:

04/12/2025 10:43:43  S    enqueuing into prod-dm, state 1 hop 1
04/12/2025 10:43:43  S    Job Queued at request of user@host, owner = user@host, job name = dn_backup_frame.20250412T0800Z.access_c4_dn, queue = prod-dm
04/12/2025 10:43:43  A    user=user group=domain users project=aps4_c jobname=dn_backup_frame.20250412T0800Z.access_c4_dn queue=prod-dm ctime=1744454623 qtime=1744454623 etime=1744454623 Resource_List.mem=1gb Resource_List.mpiprocs=1 Resource_List.ncpus=1 [...]
04/12/2025 10:43:44  L    Considering job to run
04/12/2025 10:43:44  S    Job Modified at request of Scheduler@pbs-edc
04/12/2025 10:43:44  S    Job Run at request of Scheduler@pbs-edc on exec_vnode (casey-dm01:ncpus=1:mem=1048576kb)
04/12/2025 10:43:44  L    Job run
04/12/2025 10:43:44  A    user=user group=domain users project=aps4_c jobname=dn_backup_frame.20250412T0800Z.access_c4_dn queue=prod-dm ctime=1744454623 qtime=1744454623 etime=1744454623 start=1744454624 exec_host=casey-dm01/0 exec_vnode=(casey-dm01:ncpus=1:mem=1048576kb) [...]

PBS tells Cylc that 01 succeeded

04/12/2025 10:43:45  S    Obit received momhop:1 serverhop:1 state:4 substate:42
04/12/2025 10:43:46  S    Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:02 resources_used.mem=53436kb resources_used.ncpus=1 resources_used.vmem=283892kb resources_used.walltime=00:08:43

Cylc warns about receiving success for 01 but ignores it.

2025-04-12T10:43:46Z WARNING - [20250412T0800Z/dn_backup_frame/02:running] (received-ignored)started for job(01) != job(02)
2025-04-12T10:43:47Z WARNING - [20250412T0800Z/dn_backup_frame/02:running] (received-ignored)succeeded for job(01) != job(02)

PBS records that the second job also succeeded and probably tries to tell Cylc (possibly ignored again? it's not in the logs)

   04/12/2025 10:43:51  S    Obit received momhop:1 serverhop:1 state:4 substate:42
   04/12/2025 10:43:51  S    Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:02 resources_used.mem=51448kb resources_used.ncpus=1 resources_used.vmem=105416kb resources_used.walltime=00:00:03

Cylc attempts to poll 02 which doesn't exist so the workflow waits indefinitely for this (completed) job to succeed

2025-04-12T10:51:18Z INFO - [20250412T0800Z/dn_backup_frame/02:running] poll now, (next in PT5M (after 2025-04-12T10:56:18Z))

Looking at the PBS logs, the second dn_backup_frame task (which should have been 02) was submitted as 01:

Submit_arguments = /g/sc/home_prod/user/cylc-run/access_c4_dn/log
        /job/20250412T0800Z/dn_backup_frame/01/job

Therefore, Cylc thinks that 02 was submitted (and indeed a second job was submitted) but really that was 01 again. Therefore 02 was never submitted, never ran and the folder 02 never existed. As this task was required to succeed this resulted in stalling the workflow.

I'm sorry that I have no idea how to create a reproducable example from this.

Expected Behaviour

If a submit failure occurs, new jobs are correctly submitted with the updated job ID.

Maybe the idea of ignoring job success after a new submission may need some extra consideration.

Apr 17 '25 09:04 jarich

@psa for reference

Apr 17 '25 09:04 jarich

One question that probably needs answering: Why is the jobs-submit command timing out after 10 mins (5 mins after PBS apparently submits the job)?

Therefore, Cylc thinks that 02 was submitted (and indeed a second job was submitted) but really that was 01 again. Therefore 02 was never submitted, never ran and the folder 02 never existed.

Sorry, I'm confused by this.

Can you check what the task_jobs table in the workflow DB says?

Apr 17 '25 11:04 MetRonnie

PBS accepts 01 after 5 minutes

Any idea if the 5 minute delay is due to Cylc (maybe the process pool, which handles job submission and other processes is rammed), or to PBS taking its sweet time?

[UPDATE] OK I think this is the Cylc process pool timeout:

Cylc sees and records a [jobs-submit err] killed on timeout (PT10M)

That means the job submission command has not returned after 10 minutes, even though PBS evidently accepted the job after 5 minutes. Which is strange.

Cylc prepares to resubmit 01 (again) INFO - [20250412T0800Z/dn_backup_frame/01:preparing] => waiting

The job number 01 on the left side refers to the task prior to the failure-induced state change (back to waiting on the retry). When the retry occurs the job number will be 02.

Cylc attempts to poll 02 which doesn't exist so the workflow waits indefinitely for this (completed) job to succeed

Polling doesn't (or shouldn't) work like that. If the job is no longer present in PBS, the polling process looks at the job status file (which is updated automatically by the job wrapper) to infer success or failure. There should be an additional logged message with the polled result for the task, something like (polled) succeeded. If not, maybe your system was so overloaded that even job polling was taking forever to return?

Apr 22 '25 01:04 hjoliver

The root of the problem is that something went horribly wrong with your initial job submission.

The job submission command should return immediately with the PBS job ID, if PBS accepts the job, but apparently it never returned and was killed on the default 10 min process pool timeout.

Given that, the following response by Cylc is correct:

conclude that job 01 submission failed, so tee up a configured retry (job 02)
job 02 submission succeeded, so carry on with that
ignore any later message from job 01, because it (evidently) failed to submit and so we have no PBS job ID to conclusively identify it

Then, job 02 should complete and allow the workflow to continue. I'm struggling to understand what could stop that from happening. Maybe a network outage blocked the job 02 success message, but if so job polling by Cylc should (in time) recover the correct result.

Apr 22 '25 02:04 hjoliver

Ok, so I'm the original reporter of this at our organisation. From what we could see, the job 02 was actually submitted from the 01 folder. We looked at the job for the PBS ID linked to 02, and it was saying it was 01. So there is an increment issue somewhere. I don't know how Cylc determins the directory and job number to use, but in this case, 01 was submitted twice.

Apr 22 '25 03:04 ColemanTom

@MetRonnie - the job-submit is timing out for other (non-Cylc) reasons, not really relevant to this bug report but obviously not ideal.

@hjoliver We continue to see this problem very intermittently over the weekend, where Cylc is confused with job numbering (for a second job) between scheduler and remote after an initial submit fail.
In the example below, the first submit did not create a PBS job, which differs from what @jarich posted above.

Here is some hopefully de-sensitised log output:

30-restart-03.log:2025-04-20T07:09:22Z INFO - [20250420T0000Z/glu_check_output:waiting(runahead)] => waiting
30-restart-03.log:2025-04-20T07:09:22Z INFO - [20250420T0000Z/glu_check_output:waiting] => waiting(queued)
30-restart-03.log:2025-04-20T07:09:23Z INFO - [20250420T0000Z/glu_check_output:waiting(queued)] => waiting
30-restart-03.log:2025-04-20T07:09:23Z INFO - [20250420T0000Z/glu_check_output:waiting] => preparing
30-restart-03.log:2025-04-20T07:19:24Z ERROR - [jobs-submit cmd] cat /home/xxx/access_g4/log/job/20250420T0000Z/glu_check_output/01/job | ssh -oBatchMode=yes -oConnectTimeout=10 x.x.bom.gov.au env CYLC_VERSION=8.3.6 CYLC_ENV_NAME=cylc-8.3.6 bash --login -c ''"'"'exec "$0" "$@"'"'"'' x/v2.2.0/global_model/cylc-bin/cylc jobs-submit --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/access_g4/log/job' 20250420T0000Z/glu_check_output/01
30-restart-03.log:2025-04-20T07:19:24Z ERROR - [jobs-submit cmd] ssh -oBatchMode=yes -oConnectTimeout=10 x.x.bom.gov.au env CYLC_VERSION=8.3.6 CYLC_ENV_NAME=cylc-8.3.6 bash --login -c ''"'"'exec "$0" "$@"'"'"'' x/global_model/cylc-bin/cylc jobs-submit --utc-mode --remote-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/access_g4/log/job' 20250420T0000Z/glu_check_output/01
30-restart-03.log:    [jobs-submit out] 2025-04-20T07:19:24Z|20250420T0000Z/glu_check_output/01|1
30-restart-03.log:2025-04-20T07:19:24Z CRITICAL - [20250420T0000Z/glu_check_output/01:preparing] submission failed
30-restart-03.log:2025-04-20T07:19:24Z INFO - [20250420T0000Z/glu_check_output/01:preparing] => waiting
30-restart-03.log:2025-04-20T07:19:24Z WARNING - [20250420T0000Z/glu_check_output:waiting] retrying in PT1M (after 2025-04-20T07:20:24Z)
30-restart-03.log:2025-04-20T07:20:24Z INFO - xtrigger satisfied: _cylc_submit_retry_20250420T0000Z/glu_check_output = wall_clock(trigger_time=1745133624.171367)
30-restart-03.log:2025-04-20T07:20:24Z INFO - [20250420T0000Z/glu_check_output:waiting] => waiting(queued)
30-restart-03.log:2025-04-20T07:20:24Z INFO - [20250420T0000Z/glu_check_output:waiting(queued)] => waiting
30-restart-03.log:2025-04-20T07:20:24Z INFO - [20250420T0000Z/glu_check_output:waiting] => preparing
30-restart-03.log:2025-04-20T07:20:26Z INFO - [20250420T0000Z/glu_check_output/02:preparing] submitted to <platform>:pbs[8458918]
30-restart-03.log:2025-04-20T07:20:26Z INFO - [20250420T0000Z/glu_check_output/02:preparing] => submitted
30-restart-03.log:2025-04-20T07:20:56Z INFO - [20250420T0000Z/glu_check_output/02:submitted] => running
30-restart-03.log:2025-04-20T07:22:26Z WARNING - [20250420T0000Z/glu_check_output/02:running] (received-ignored)started for job(01) != job(02)
30-restart-03.log:2025-04-20T07:27:03Z WARNING - [20250420T0000Z/glu_check_output/02:running] (received-ignored)succeeded for job(01) != job(02)
30-restart-03.log:2025-04-20T07:30:56Z INFO - [20250420T0000Z/glu_check_output/02:running] poll now, (next in PT1M (after 2025-04-20T07:31:56Z))
30-restart-03.log:2025-04-20T07:31:56Z INFO - [20250420T0000Z/glu_check_output/02:running] poll now, (next in PT2M (after 2025-04-20T07:33:56Z))
30-restart-03.log:2025-04-20T07:33:57Z INFO - [20250420T0000Z/glu_check_output/02:running] poll now, (next in PT7M (after 2025-04-20T07:40:57Z))
30-restart-03.log:2025-04-20T07:40:57Z INFO - [20250420T0000Z/glu_check_output/02:running] poll now, (next in PT7M (after 2025-04-20T07:47:57Z))`

I checked on-disk and there was no 02 only 01 log directory, which matches the (received-ignored)started for job(01) != job(02) messages above:

[me@sc 20250420T0600Z]$ cd glu_check_output/
[me@sc glu_check_output]$ ll
total 4
drwxr-xr-x 2 <user> <group> 4096 Apr 20 12:28 01
lrwxrwxrwx 1 <user> <group>    2 Apr 20 12:28 NN -> 01

So somehow Cylc is resubmitting the job as 01 even though it believes it is submitting 02. This results in the task never completing, success or fail, and eventual workflow stall.

In the GUI, you get a long list of Task message: poll failed and if a kill is attempted, similar:

Apr 22 '25 04:04 DamianAgius

OK that seems like useful info, thanks. If the 02 job somehow gets out in the wild labelled as 01 that would certainly prevent the graph branch from continuing, because the 01 job has been deliberately orphaned an any subsequent messages from it will be ignored.

Apr 22 '25 04:04 hjoliver

I haven't managed to reproduce this, despite causing job submission failure in various ways - job 02 is working as intended in my environment ... still looking.

Apr 22 '25 05:04 hjoliver

Can you try to reproduce this in your environment by deliberately causing job submission failure in a trivial test workflow? You could e.g. use a bad PBS directive, or copy the platform config and mess with the job runner command template e.g. to xsub %(job)s.

(That will cause job 02 to submit-fail as well, but that's enough to generate the 02 job log directory).

Apr 22 '25 05:04 hjoliver

Can you try to reproduce this in your environment by deliberately causing job submission failure in a trivial test workflow?

I can't, maybe the others can. But I wanted to note that in the first case mentioned, 01 did submit, it took 5 minutes for the task to appear in PBS, and then the communication back to the Cylc servers must have taken more than 5 minutes. So it may not always be that submission failed, but rather communication back is held up and resubmit attempted whilst the original submission is actually stuck and still processing.

Apr 22 '25 05:04 ColemanTom

That still counts as job submission failure to Cylc though - because the long delay ran into the process pool timeout (10 mins) so the still-running submission process got killed (god knows why it was still running, but apparently it was!). Even if that happens, job 02 should submit with the intended "02" identity.

Apr 22 '25 05:04 hjoliver

I've done some investigation on our machine by manipulating when tasks start, I've not been able to reproduce the issue. The original timeout will be from qsub stalling due to server load or something past Cylc's process pool timeout, but still submitting the actual task.

The polling errors I don't understand. I think we're going to need to wait for the error to re-occur to be able to properly diagnose this error, the logs from the original events are no longer accessible to me.

I think the resubmitted '01' job may be a red herring - in Damian's example that's looking at the directory for a different cycle.

May 02 '25 03:05 ScottWales

I think the resubmitted '01' job may be a red herring - in Damian's example that's looking at the directory for a different cycle.

I disagree with this statement. My example (posted by Jacinta at the top) clearly shows that the second submission, as captured by the PBS job ID mappings against what the server thought was 02, was submitted from the 01 folder. The job log itself also had 01 in the top part. The server thought 02 was submitted, but the HPC side submitted it from 01, using the 01 job script.

May 02 '25 03:05 ColemanTom

I think that in your example 6553013 was the submission for job 01, so it's using the right script. 6552935 is definitely job 02, unfortunately we don't have PBS info on that one. I believe the PBS ids are in the wrong order due to qsub weirdness around the original timeout.

6552553 is an unrelated task (createbc rather than frame).

May 02 '25 03:05 ScottWales

I think that in your example 6553013 was the submission for job 01, so it's using the right script. 6552935 is definitely job 02, unfortunately we don't have PBS info on that one. I believe the PBS ids are in the wrong order due to qsub weirdness around the original timeout.

No 02 directory was made. Both jobs had the submit script as the same script. Whilst that information is not in this ticket, qstat -f was compared.

May 02 '25 03:05 ColemanTom