Handle restart with changed sequence better
Description
Restarting a workflow with a waiting or orphaned task that is no longer on sequence (due to flow.cylc having been changed) can result in traceback.
Reproducible Example
Run this workflow:
[scheduling]
cycling mode = integer
initial cycle point = 1
[[graph]]
P1 = foo[-P1] => foo
[runtime]
[[foo]]
script = """
if [[ $CYLC_TASK_CYCLE_POINT == 2 ]]; then
cylc stop "$CYLC_WORKFLOW_ID" --now;
fi
"""
When it shuts down, edit flow.cylc:
[[graph]]
- P1 = foo[-P1] => foo
+ R1 = foo[-P1] => foo
[runtime]
Then restart the workflow to get:
$ cylc vr <id>
$ cylc cat <id>
ERROR - list index out of range
Traceback (most recent call last):
File "~/github/cylc-flow/cylc/flow/scheduler.py", line 742, in start
await self.configure(params)
File "~/github/cylc-flow/cylc/flow/scheduler.py", line 490, in configure
self._load_pool_from_db()
File "~/github/cylc-flow/cylc/flow/scheduler.py", line 792, in _load_pool_from_db
self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
File "~/github/cylc-flow/cylc/flow/rundb.py", line 932, in select_task_pool_for_restart
platform_error = callback(row_idx, list(row))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/github/cylc-flow/cylc/flow/task_pool.py", line 597, in load_db_task_pool_for_restart
self.compute_runahead()
File "~/github/cylc-flow/cylc/flow/task_pool.py", line 400, in compute_runahead
limit_point = sorted(sequence_points)[:ilimit + 1][-1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range
CRITICAL - Workflow shutting down - list index out of range
Expected Behaviour
No traceback; more helpful error message.
Need to handle sequence_points being empty here: https://github.com/cylc/cylc-flow/blob/3962f3d4df171f7ffde4f233e8859abedb3ad451/cylc/flow/task_pool.py#L400
@MetRonnie, I've not been able to reproduce this one, have tried 8.3.x, 8.3.0 and 8.2.3.
Might need a tweak to the example.
It's possible that this PR might help: https://github.com/cylc/cylc-flow/pull/6229
I can still reproduce on 8.3.2 and 8.3.x as of dc9f01b72ecb022e12d024aaf7c0c0a32a5c6625, from copying the example above.
Ach, blindingly obvious, I didn't reinstall! #6229 does not appear to fix this :(
This is very similar to #6229, the compute_runahead algorithm is (correctly) erroring due to a complete lack of tasks.
- In the case of #6229, this occurs when starting a workflow for the first time.
- In this case the error is introduced later by restart/reload.
This latter case should be extremely rare and is an error case in the first place so fairly low priority.
Easy fix however!
With this diff the workflow will shutdown gracefully on restart/reload:
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..c992e880b 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
self._prev_runahead_sequence_points = sequence_points
self._prev_runahead_base_point = base_point
+ if len(sequence_points) == 0:
+ # no cycles to schedule
+ return False
if count_cycles:
# (len(list) may be less than ilimit due to sequence end)
limit_point = sorted(sequence_points)[:ilimit + 1][-1]
However, a reload that essentially empties the task pool is sure to be a mistake, so going ahead with the restart/reload and wiping out the pool making it difficult to recover the workflow is not necessarily the best move.
We could consider raising an error here (examples of this form are the only confirmed way to hit this bug). This would cause the restart/reload to fail with an informative message:
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..647353378 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
self._prev_runahead_sequence_points = sequence_points
self._prev_runahead_base_point = base_point
+ if len(sequence_points) == 0:
+ # no cycles to schedule
+ raise WorkflowConfigError('No tasks scheduled to run')
if count_cycles:
# (len(list) may be less than ilimit due to sequence end)
limit_point = sorted(sequence_points)[:ilimit + 1][-1]
Which will it be?
- Do what the user asked, log something helpful.
- Prevent the operation from being actioned, raise a helpful error.
Option 2 is no worse than the current behaviour so could at least be implemented as a quick address of the issue.
Edit: however, with the option 2 patch, I've just seen that reloading a workflow after changing
[[graph]]
- P1 = foo[-P1] => foo
+ R1 = foo[-P1] => foo
causes the workflow to get stuck in the reloading state after the WorkflowConfigError is raised. (This is pretty identical to the current behaviour).
Errors raised during reload should cause the reload to be aborted. The error should be logged and the workflow should continue with the original config.
If that's not happening, we should fix it. I think the reload aborts correctly on master (due to the error in compute_runahead).
It aborts, however the workflow status gets stuck as reloading