cylc-flow
cylc-flow copied to clipboard
job polling broken for failed jobs after restart
tldr;
Failed tasks can be polled back to incorrect states on restart.
Bug:
After a restart Cylc updates task proxies with the owner@host
pair of submitted/running jobs to allow polling:
https://github.com/cylc/cylc-flow/blob/5ef44194a3e28f21944ae894c52f0c523d590f94/lib/cylc/task_pool.py#L361-L369
This, however, excludes succeeded and failed tasks. Consequently, following restart remote tasks do not have their owner@host
loaded from the DB which causes polling to run locally.
Polling will most likely fail but could also produce unexpected results (particularly for the case of background jobs).
This may be related to #1792 which extended polling to succeeded / failed tasks but didn't extend the owner@host
update logic:
https://github.com/cylc/cylc-flow/pull/2396/files#diff-1f1aa9b850f9d1655a22322beb0e2d0604fb816b3bc807210120547f1a35ae24
When this effect is combined with a task failing by hitting execution time limit on a remote batch system (that is not pollable locally) this causes the task to be polled back to running.
Reproducible Example:
[scheduling]
[[dependencies]]
graph = """
a
a:fail => restart
"""
[runtime]
[[a]]
script = """
sleep 60
"""
[[[remote]]]
host = <host>
[[[job]]]
execution time limit = PT1S
batch system = pbs
[[restart]]
script = """
cylc stop "${CYLC_SUITE_NAME}" --now --now
sleep 5
cylc restart "${CYLC_SUITE_NAME}" --host=localhost
"""
Log Snippet (post-restart):
LOADING task proxies
+ a.1 failed
+ restart.1 running
LOADING task action timers
+ a.1 [[u'job-logs-retrieve', u'failed'], 1]
+ a.1 [u'try_timers', u'retrying']
+ a.1 [u'try_timers', u'submit-retrying']
+ restart.1 poll_timer
+ restart.1 [u'try_timers', u'retrying']
+ restart.1 [u'try_timers', u'submit-retrying']
[a.1] status=failed: (polled)succeeded at 2021-11-16T10:15:19Z for job(01) <= ERROR
[restart.1] status=running: (polled)succeeded at 2021-11-16T10:17:12Z for job(01)
Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md
before starting any work though).
I can't test this with Cylc 8 at the moment, however, I expect the bug will likely be present there too.
The solution is presumably to update the owner@host
for succeeded and failed tasks. Will need to check the logic to ensure this doesn't produce any unexpected side effects in other parts of the code e.g. host-selection.
Cylc 8 issue - https://github.com/cylc/cylc-flow/issues/4513
Cylc 8 issue - #4513
#4513 is really a different issue (polling doing the wrong thing). This issue is about polling happening on the wrong platform.
I've confirmed that this remains an issue at Cylc 8. I've had 2 recent reports of this problem so I think we really need to get it fixed in both Cylc 7 & 8.
Closed by #5016