cylc-flow
cylc-flow copied to clipboard
Making reload safer
re https://cylc.discourse.group/t/task-wrangling-issues/517/8 We need to make it obvious when reloads fail.
I think reload should:
- pause the workflow (and possibly wait for any tasks in the preparing state to submit?)
- reload the config (log what has changed?)
- if the reload succeeds, release the workflow (unless the workflow was previously held)
- if the reload fails, leave the workflow paused so the user has the choice to resume with the existing workflow definition or fix the problem and reload again
Also, what about restarts? If a server dies and you restart a workflow it's meant to continue unchanged. However, what happens if you've done a reinstall but not reloaded (and perhaps even have a workflow config which does not validate)? I think restart should always use the previous config. If this differs from the installed config this should be logged (if possible) but not loaded. We could add a --reload option to cylc play to support the old behaviour.
and possibly wait for any tasks in the preparing state to submit?
We would need to flush through all tasks in the preparing state (otherwise we can't say for sure whether the job file was written with the old or new settings) whilst the workflow is paused (otherwise tasks will continue to enter the preparing state).
One downside of this is that a remote operation (e.g. remote-init) running during the reload process would cause the workflow to become unresponsive for that period. Without major changes there would be no way to "cancel" a reload once actioned.
I think restart should always use the previous config.
It should also not repeat any remote file installation.
With the current approach to reload it's not especially clear what is and isn't supposed to change as a result of a reload.
I went through the code and pulled out the attributes which are and aren't reloaded:
Task Proxy Attributes
Reloaded (not explicitly preserved):
- clock_trigger_time
- expire_time
- graph_children
- is_late
- late_time
- non_unique_events
- reload_successor
- tdef
- waiting_on_job_prep
Preserved:
- flow_wait
- identity (implicitly preserved as TaskDef cannot change name, point is preserved)
- is_manual_submit
- job_vacated
- jobs
- local_job_file_path
- platform (Note: task_events_mgr looks at the task def not the
.platform
attribute for this) - poll_timer
- state
- submit_num
- summary
- timeout
- tokens (implicitly preserved as identity is preserved)
- try_timers
- flow_nums (set in Task Pool)
- point (set in Task Pool)
- point_as_seconds (set in Task Pool)
Task State Attributes
Reloaded (not explicitly preserved):
- _is_satisfied
- _suicide_is_satisfied
- external_triggers
- is_queued
- kill_failed
- suicide_prerequisites
- time_updated
- xtriggers (
_cylc
xtriggers are preserved, others are permitted to change)
Preserved:
- is_held
- is_runahead
- is_updated
- outputs
- prerequisites
- status (set in Task Pool)
Main takeaway is that prereqs and outputs can't be changed by a reload. Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?
Well done, good to have that info laid out in full :+1:
Main takeaway is that prereqs and outputs can't be changed by a reload.
Worth pointing out, for anyone reading this, we're only talking about reloading existing task proxies here - everything can be changed by reload if the task hasn't been spawned yet.
Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?
Yeah, we should be able to do that.
What should be preserved:
- any attribute that relates to the history of this particular task proxy (jobs, submit number etc.)
What should be reloaded:
- everything else
Any exceptions to that?
Just realized I didn't comment on @dpmatthews suggestions at the top.
I like the reload suggestions :+1: And I particularly like the restart idea: always restart with the pre-restart config, so the user always has to reload an updated flow.cylc. (However, this will make it even more important to get reload right, as restart won't be a fall back any more.)
For the record here's a portion of the diff I was working on
for itask in tasks:
if itask.tdef.name in self.orphans:
if (
- itask.state(TASK_STATUS_WAITING)
- or itask.state.is_held
- or itask.state.is_queued
+ itask.state(TASK_STATUS_WAITING)
+ or itask.state.is_held
+ or itask.state.is_queued
):
# Remove orphaned task if it hasn't started running yet.
self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
"- task definition removed"
)
else:
- new_task = TaskProxy(
- self.config.get_taskdef(itask.tdef.name),
- itask.point, itask.flow_nums, itask.state.status)
- itask.copy_to_reload_successor(new_task)
- self._swap_out(new_task)
- LOG.info(f"[{itask}] reloaded task definition")
- if itask.state(*TASK_STATUSES_ACTIVE):
+ if (
+ itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+ or itask.waiting_on_job_prep # pre preparation
+ ):
LOG.warning(
f"[{itask}] active with pre-reload settings"
)
- elif itask.state(TASK_STATUS_PREPARING):
- # Job file might have been written at this point?
- LOG.warning(
- f"[{itask}] may be active with pre-reload settings"
- )
+ else:
+ itask.reload(self.config.get_taskdef(itask.tdef.name))
+ LOG.info(f"[{itask}] reloaded task definition")
+ # new_task = TaskProxy(
+ # self.config.get_taskdef(itask.tdef.name),
+ # itask.point, itask.flow_nums, itask.state.status)
+ # itask.copy_to_reload_successor(new_task)
+ # self._swap_out(new_task)
+ # LOG.info(f"[{itask}] reloaded task definition")
+ # if itask.state(*TASK_STATUSES_ACTIVE):
+ # LOG.warning(
+ # f"[{itask}] active with pre-reload settings"
+ # )
+ # elif itask.state(TASK_STATUS_PREPARING):
+ # # Job file might have been written at this point?
+ # LOG.warning(
+ # f"[{itask}] may be active with pre-reload settings"
+ # )
diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
)
+ def reload(self, tdef):
+ # swap out the task definition
+ self.tdef = tdef
+
+ # reload the task state (incl. triggers & outputs)
+ # self.state.reload(tdef, self.point)
+
+ # recompute derived values
+ self.get_late_time()
+ self.graph_children = generate_graph_children(tdef, self.point)
+
+ self.state.is_queued = False
+ self.state.is_updated = True
for itask in tasks:
if itask.tdef.name in self.orphans:
if (
- itask.state(TASK_STATUS_WAITING)
- or itask.state.is_held
- or itask.state.is_queued
+ itask.state(TASK_STATUS_WAITING)
+ or itask.state.is_held
+ or itask.state.is_queued
):
# Remove orphaned task if it hasn't started running yet.
self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
"- task definition removed"
)
else:
- new_task = TaskProxy(
- self.config.get_taskdef(itask.tdef.name),
- itask.point, itask.flow_nums, itask.state.status)
- itask.copy_to_reload_successor(new_task)
- self._swap_out(new_task)
- LOG.info(f"[{itask}] reloaded task definition")
- if itask.state(*TASK_STATUSES_ACTIVE):
+ if (
+ itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+ or itask.waiting_on_job_prep # pre preparation
+ ):
LOG.warning(
f"[{itask}] active with pre-reload settings"
)
- elif itask.state(TASK_STATUS_PREPARING):
- # Job file might have been written at this point?
- LOG.warning(
- f"[{itask}] may be active with pre-reload settings"
- )
+ else:
+ itask.reload(self.config.get_taskdef(itask.tdef.name))
+ LOG.info(f"[{itask}] reloaded task definition")
+ # new_task = TaskProxy(
+ # self.config.get_taskdef(itask.tdef.name),
+ # itask.point, itask.flow_nums, itask.state.status)
+ # itask.copy_to_reload_successor(new_task)
+ # self._swap_out(new_task)
+ # LOG.info(f"[{itask}] reloaded task definition")
+ # if itask.state(*TASK_STATUSES_ACTIVE):
+ # LOG.warning(
+ # f"[{itask}] active with pre-reload settings"
+ # )
+ # elif itask.state(TASK_STATUS_PREPARING):
+ # # Job file might have been written at this point?
+ # LOG.warning(
+ # f"[{itask}] may be active with pre-reload settings"
+ # )
diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
)
+ def reload(self, tdef):
+ # swap out the task definition
+ self.tdef = tdef
+
+ # recompute derived values
+ self.get_late_time()
+ self.graph_children = generate_graph_children(tdef, self.point)
+
+ self.state.is_queued = False
+ self.state.is_updated = True
Fairly straight forward, this change should simplify things a fair bit. The remaining difficulty is the reloading of outputs / prerequisites which turned out to be too complicated to hash out quickly for 8.0.x (too complicated and too risky too).
See also #5579, we should probably wait for preparing tasks to clear before pausing the workflow.
I'm taking a look into flushing out preparing tasks and pausing the workflow pre-reload as a means to fixing #5579. If successful I will punt the remainder of this issue (to do with the interaction of restart and reload) to another issue for later consideration.
Bumped the cylc play --reload
stuff to https://github.com/cylc/cylc-flow/issues/5591