cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Making reload safer

Open dpmatthews opened this issue 2 years ago • 7 comments

re https://cylc.discourse.group/t/task-wrangling-issues/517/8 We need to make it obvious when reloads fail.

I think reload should:

  • pause the workflow (and possibly wait for any tasks in the preparing state to submit?)
  • reload the config (log what has changed?)
  • if the reload succeeds, release the workflow (unless the workflow was previously held)
  • if the reload fails, leave the workflow paused so the user has the choice to resume with the existing workflow definition or fix the problem and reload again

Also, what about restarts? If a server dies and you restart a workflow it's meant to continue unchanged. However, what happens if you've done a reinstall but not reloaded (and perhaps even have a workflow config which does not validate)? I think restart should always use the previous config. If this differs from the installed config this should be logged (if possible) but not loaded. We could add a --reload option to cylc play to support the old behaviour.

dpmatthews avatar Aug 26 '22 15:08 dpmatthews

and possibly wait for any tasks in the preparing state to submit?

We would need to flush through all tasks in the preparing state (otherwise we can't say for sure whether the job file was written with the old or new settings) whilst the workflow is paused (otherwise tasks will continue to enter the preparing state).

One downside of this is that a remote operation (e.g. remote-init) running during the reload process would cause the workflow to become unresponsive for that period. Without major changes there would be no way to "cancel" a reload once actioned.

oliver-sanders avatar Aug 31 '22 09:08 oliver-sanders

I think restart should always use the previous config.

It should also not repeat any remote file installation.

dpmatthews avatar Aug 31 '22 16:08 dpmatthews

With the current approach to reload it's not especially clear what is and isn't supposed to change as a result of a reload.

I went through the code and pulled out the attributes which are and aren't reloaded:

Task Proxy Attributes

Reloaded (not explicitly preserved):

  • clock_trigger_time
  • expire_time
  • graph_children
  • is_late
  • late_time
  • non_unique_events
  • reload_successor
  • tdef
  • waiting_on_job_prep

Preserved:

  • flow_wait
  • identity (implicitly preserved as TaskDef cannot change name, point is preserved)
  • is_manual_submit
  • job_vacated
  • jobs
  • local_job_file_path
  • platform (Note: task_events_mgr looks at the task def not the .platform attribute for this)
  • poll_timer
  • state
  • submit_num
  • summary
  • timeout
  • tokens (implicitly preserved as identity is preserved)
  • try_timers
  • flow_nums (set in Task Pool)
  • point (set in Task Pool)
  • point_as_seconds (set in Task Pool)

Task State Attributes

Reloaded (not explicitly preserved):

  • _is_satisfied
  • _suicide_is_satisfied
  • external_triggers
  • is_queued
  • kill_failed
  • suicide_prerequisites
  • time_updated
  • xtriggers (_cylc xtriggers are preserved, others are permitted to change)

Preserved:

  • is_held
  • is_runahead
  • is_updated
  • outputs
  • prerequisites
  • status (set in Task Pool)

Main takeaway is that prereqs and outputs can't be changed by a reload. Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?

oliver-sanders avatar Sep 12 '22 12:09 oliver-sanders

Well done, good to have that info laid out in full :+1:

Main takeaway is that prereqs and outputs can't be changed by a reload.

Worth pointing out, for anyone reading this, we're only talking about reloading existing task proxies here - everything can be changed by reload if the task hasn't been spawned yet.

Since SoD this should probably be possible now. It makes sense to add or remove a prerequisite of a waiting task or add an output to a failed one?

Yeah, we should be able to do that.

hjoliver avatar Sep 15 '22 04:09 hjoliver

What should be preserved:

  • any attribute that relates to the history of this particular task proxy (jobs, submit number etc.)

What should be reloaded:

  • everything else

Any exceptions to that?

hjoliver avatar Sep 15 '22 05:09 hjoliver

Just realized I didn't comment on @dpmatthews suggestions at the top.

I like the reload suggestions :+1: And I particularly like the restart idea: always restart with the pre-restart config, so the user always has to reload an updated flow.cylc. (However, this will make it even more important to get reload right, as restart won't be a fall back any more.)

hjoliver avatar Sep 15 '22 07:09 hjoliver

For the record here's a portion of the diff I was working on
         for itask in tasks:
             if itask.tdef.name in self.orphans:
                 if (
-                        itask.state(TASK_STATUS_WAITING)
-                        or itask.state.is_held
-                        or itask.state.is_queued
+                    itask.state(TASK_STATUS_WAITING)
+                    or itask.state.is_held
+                    or itask.state.is_queued
                 ):
                     # Remove orphaned task if it hasn't started running yet.
                     self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
                         "- task definition removed"
                     )
             else:
-                new_task = TaskProxy(
-                    self.config.get_taskdef(itask.tdef.name),
-                    itask.point, itask.flow_nums, itask.state.status)
-                itask.copy_to_reload_successor(new_task)
-                self._swap_out(new_task)
-                LOG.info(f"[{itask}] reloaded task definition")
-                if itask.state(*TASK_STATUSES_ACTIVE):
+                if (
+                    itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+                    or itask.waiting_on_job_prep  # pre preparation
+                ):
                     LOG.warning(
                         f"[{itask}] active with pre-reload settings"
                     )
-                elif itask.state(TASK_STATUS_PREPARING):
-                    # Job file might have been written at this point?
-                    LOG.warning(
-                        f"[{itask}] may be active with pre-reload settings"
-                    )
+                else:
+                    itask.reload(self.config.get_taskdef(itask.tdef.name))
+                    LOG.info(f"[{itask}] reloaded task definition")
+                # new_task = TaskProxy(
+                #     self.config.get_taskdef(itask.tdef.name),
+                #     itask.point, itask.flow_nums, itask.state.status)
+                # itask.copy_to_reload_successor(new_task)
+                # self._swap_out(new_task)
+                # LOG.info(f"[{itask}] reloaded task definition")
+                # if itask.state(*TASK_STATUSES_ACTIVE):
+                #     LOG.warning(
+                #         f"[{itask}] active with pre-reload settings"
+                #     )
+                # elif itask.state(TASK_STATUS_PREPARING):
+                #     # Job file might have been written at this point?
+                #     LOG.warning(
+                #         f"[{itask}] may be active with pre-reload settings"
+                #     )

diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
             f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
         )
 
+    def reload(self, tdef):
+        # swap out the task definition
+        self.tdef = tdef
+
+        # reload the task state (incl. triggers & outputs)
+        # self.state.reload(tdef, self.point)
+
+        # recompute derived values
+        self.get_late_time()
+        self.graph_children = generate_graph_children(tdef, self.point)
+
+        self.state.is_queued = False
+        self.state.is_updated = True
         for itask in tasks:
             if itask.tdef.name in self.orphans:
                 if (
-                        itask.state(TASK_STATUS_WAITING)
-                        or itask.state.is_held
-                        or itask.state.is_queued
+                    itask.state(TASK_STATUS_WAITING)
+                    or itask.state.is_held
+                    or itask.state.is_queued
                 ):
                     # Remove orphaned task if it hasn't started running yet.
                     self.remove(itask, 'task definition removed')
@@ -881,21 +881,31 @@ class TaskPool:
                         "- task definition removed"
                     )
             else:
-                new_task = TaskProxy(
-                    self.config.get_taskdef(itask.tdef.name),
-                    itask.point, itask.flow_nums, itask.state.status)
-                itask.copy_to_reload_successor(new_task)
-                self._swap_out(new_task)
-                LOG.info(f"[{itask}] reloaded task definition")
-                if itask.state(*TASK_STATUSES_ACTIVE):
+                if (
+                    itask.state(TASK_STATUS_PREPARING, *TASK_STATUSES_ACTIVE)
+                    or itask.waiting_on_job_prep  # pre preparation
+                ):
                     LOG.warning(
                         f"[{itask}] active with pre-reload settings"
                     )
-                elif itask.state(TASK_STATUS_PREPARING):
-                    # Job file might have been written at this point?
-                    LOG.warning(
-                        f"[{itask}] may be active with pre-reload settings"
-                    )
+                else:
+                    itask.reload(self.config.get_taskdef(itask.tdef.name))
+                    LOG.info(f"[{itask}] reloaded task definition")
+                # new_task = TaskProxy(
+                #     self.config.get_taskdef(itask.tdef.name),
+                #     itask.point, itask.flow_nums, itask.state.status)
+                # itask.copy_to_reload_successor(new_task)
+                # self._swap_out(new_task)
+                # LOG.info(f"[{itask}] reloaded task definition")
+                # if itask.state(*TASK_STATUSES_ACTIVE):
+                #     LOG.warning(
+                #         f"[{itask}] active with pre-reload settings"
+                #     )
+                # elif itask.state(TASK_STATUS_PREPARING):
+                #     # Job file might have been written at this point?
+                #     LOG.warning(
+                #         f"[{itask}] may be active with pre-reload settings"
+                #     )


diff --git a/cylc/flow/task_proxy.py b/cylc/flow/task_proxy.py
index b5511ffb1..7216993a4 100644
--- a/cylc/flow/task_proxy.py
+++ b/cylc/flow/task_proxy.py
@@ -266,6 +266,20 @@ class TaskProxy:
             f" flows:{','.join(str(i) for i in self.flow_nums) or 'none'}"
         )
 
+    def reload(self, tdef):
+        # swap out the task definition
+        self.tdef = tdef
+
+        # recompute derived values
+        self.get_late_time()
+        self.graph_children = generate_graph_children(tdef, self.point)
+
+        self.state.is_queued = False
+        self.state.is_updated = True

Fairly straight forward, this change should simplify things a fair bit. The remaining difficulty is the reloading of outputs / prerequisites which turned out to be too complicated to hash out quickly for 8.0.x (too complicated and too risky too).

oliver-sanders avatar Sep 15 '22 11:09 oliver-sanders

See also #5579, we should probably wait for preparing tasks to clear before pausing the workflow.

oliver-sanders avatar Jun 13 '23 10:06 oliver-sanders

I'm taking a look into flushing out preparing tasks and pausing the workflow pre-reload as a means to fixing #5579. If successful I will punt the remainder of this issue (to do with the interaction of restart and reload) to another issue for later consideration.

oliver-sanders avatar Jun 19 '23 11:06 oliver-sanders

Bumped the cylc play --reload stuff to https://github.com/cylc/cylc-flow/issues/5591

oliver-sanders avatar Jun 19 '23 12:06 oliver-sanders