cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

workflow events: fix an issue where "timeout" events would not fire

Open oliver-sanders opened this issue 2 years ago • 3 comments

The result of attempting to document workflow events here: https://github.com/cylc/cylc-doc/pull/685

Write some integration tests, discovered that a couple events weren't being fired in some situations.

[edit] It turns out that the way the scheduler terminates the subproc pool on shutdown means that, whilst this PR fixes the "event not being fired" problem, the event handler still won't actually run. This part of the issue has been bumped to #5997

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • [x] Tests are included (or explain why tests are not needed).
  • [x] CHANGES.md entry included if this is a change that can affect users
  • [x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • [x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders avatar Feb 07 '24 11:02 oliver-sanders

[scheduler]
    [[events]]
        abort on workflow timeout = True
        workflow timeout = PT1S
        abort handlers = echo "ALPHA"
        workflow timeout handlers = echo "TANGO"

The workflow timeout handler is not running when abort on workflow timeout is set

MetRonnie avatar Feb 21 '24 16:02 MetRonnie

So, it looks like it actually is firing off the event (and the test picked up on this).

BUT, on workflow abort, we terminate all processes in the subprocpool which kills the handler command, likely before it has had the chance to be issued:

https://github.com/cylc/cylc-flow/blob/acb164f552a9f89565c662a5b5071b1ab83e44f1/cylc/flow/scheduler.py#L1925-L1937

We could implement a mini-main loop, just to run the subprocpool until it empties naturally, but it could contain all sorts of stuff, like job submission commands which we probably don't want to run.

Dammit. All I wanted to do was to clarify one line in the docs!

oliver-sanders avatar Feb 23 '24 15:02 oliver-sanders

We could implement a mini-main loop, just to run the subprocpool until it empties naturally

I've just tried this, but it gets messy.

patch
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 432ef11ff..687ebbb64 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1415,6 +1415,7 @@ class Scheduler:
 
         Run workflow events in simulation and dummy mode ONLY if enabled.
         """
+        LOG.warning(f'EVENT {event} - {reason}')
         conf = self.config
         with suppress(KeyError):
             if (
@@ -1925,7 +1926,8 @@ class Scheduler:
                 if self.proc_pool.is_not_done():
                     # e.g. KeyboardInterrupt
                     self.proc_pool.terminate()
-                self.proc_pool.process()
+                while self.proc_pool.is_not_done():
+                    self.proc_pool.process()
             except Exception as exc:
                 LOG.exception(exc)
 
diff --git a/cylc/flow/subprocpool.py b/cylc/flow/subprocpool.py
index c380bdaa3..a539aa8c5 100644
--- a/cylc/flow/subprocpool.py
+++ b/cylc/flow/subprocpool.py
@@ -332,16 +332,38 @@ class SubProcPool:
     def terminate(self):
         """Drain queue, and kill and process remaining child processes."""
         self.close()
+        keep = []
         # Drain queue
         while self.queuings:
-            ctx = self.queuings.popleft()[0]
-            ctx.err = self.ERR_WORKFLOW_STOPPING
-            ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
-            self._run_command_exit(ctx)
+            item = self.queuings.popleft()
+            ctx = item[0]
+            # if ctx.cmd_key.startswith(WORKFLOW_EVENT_HANDLER)
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if cmd_key.startswith('workflow-event-handler'):
+                keep.append(item)
+            else:
+                ctx.err = self.ERR_WORKFLOW_STOPPING
+                ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
+                self._run_command_exit(ctx)
+
+        # Re-add event handlers back into the queue
+        for item in keep:
+            self.queuings.append(item)
+
         # Kill remaining processes
         for value in self.runnings:
             proc = value[0]
-            if proc:
+            ctx = value[1]
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if not proc:
+                # no process to kill
+                continue
+            if not cmd_key.startswith('workflow-event-handler'):
+                # don't kill event handlers
                 _killpg(proc, SIGKILL)
         # Wait for child processes
         self.process()

Any job submission commands get killed, the fallout from this causes the task(s) to be put in the submit-failed state which in turn causes them to be marked as incomplete.

I'm not sure this has ever worked!

oliver-sanders avatar Feb 23 '24 16:02 oliver-sanders

Merging with three approvals.

oliver-sanders avatar Mar 25 '24 11:03 oliver-sanders