cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

parentless sequential xtrigger spawning

Open dwsutherland opened this issue 2 years ago • 27 comments

supersedes #5732

Spawning parentless xtrigger tasks out to the run ahead limit is unnecessary in the majority of cases, and creates a lot of UI clutter.

This PR makes sequential spawning on xtrigger satisfaction for parentless tasks optional, with non-sequential the default (no change). And the default for clock triggers.

Justification:

  • Naturally clock trigger tasks are sequential in time, so it makes sense to only spawn the next parentless clock-triggered task when the current clock trigger is satisfied.
  • Most workflows are sequential, and so workflows/downstream-workflows usually don't need to check xtriggers out to the runahead limit (i.e. workflow_state).
  • It makes little difference (split seconds) in "catch-up" mode, where all xtriggers will be satisfied on first check.
  • It puts less load on the Scheduler and UI, with ~1/runahead-limit times less xtriggers to check and tasks to show.
  • If the user wants non-sequential xtrigger checking, that xtrigger can be specified as such.

At NIWA we have a modular approach to running our workflows, both in operations (around ~50 workflows, most interconnected) and in research (commonly downstream of operational workflows). So wall_clock and workflow_state are by far the most common xtriggers. We do have some requirement for non-sequential workflow_state xtriggers (i.e. satellite checking, where data may or may not come in for a cycle), however, nearly all workflows and downstream workflows are sequential.. And many of these require very large runahead limits.

Take the following simple example;

[scheduling]
    initial cycle point = 20230920T0530Z
    runahead limit = P5
    [[xtriggers]]
        clock_1 = wall_clock()
        up_1 = workflow_state(workflow=clockspawn, task=b, point=%(point)s):PT7S
    [[graph]]
        PT1M = """
@clock_1 => a
a => b
@up_1 => c
"""

[runtime]
    [[root]]
        script = sleep 2
    [[a,b,c]]

Currently that produces: image

With this default behavior changed;

[scheduling]
    initial cycle point = 20230920T0530Z
    runahead limit = P5
    spawn from xtriggers sequentially = True
    [[xtriggers]]
        clock_1 = wall_clock()
        up_1 = workflow_state(workflow=clockspawn, task=b, point=%(point)s):PT7S
.
.
.

it is reduced to: image

Real workflows commonly have way more tasks (branches length and/or parallel) and way larger runahead limits, so the efficiency and clutter-less gains are massive.

And if for some reason users want some xtriggers to be checked/run out into the future/RH-limit in a non-sequential manner, then they can override the default with an argument (i.e. sequential=True/False):

    spawn from xtriggers sequentially = True
    [[xtriggers]]
        clock_1 = wall_clock()
        up_1 = workflow_state(workflow=clockspawn, task=b, point=%(point)s, sequential=False):PT7S
.
.
.

image (if there's multiple xtriggers on a single task it upholds the sequential setting if any)

This sequential argument name is reserved, and not passed onto the xtrigger module/function. It can be defined in the xtrigger module/function (as in workflow declaration), but must have a Boolean default: image (i.e. from modifying cylc.flow.xtriggers.workflow_state to include settings arg in function definition) cylc.flow.xtriggers.wall_clock.wall_clock has sequential=True set

On xtrigger satisfaction the corresponding parentless sequential xtriggered task (PSXT) will spawn out to the RH limit, or next occurrence of a PSXT, or non-parentless task occurrence.

Note: Tui sometimes shows no tasks before a corresponding xtrigger is satisfied. This problem is not reflected in the data-store or WUI.

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • [x] Tests are included (or explain why tests are not needed).
  • [ ] CHANGES.md entry included if this is a change that can affect users
  • [ ] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.

dwsutherland avatar Sep 20 '23 04:09 dwsutherland

Note, the Cylc 7 spawning behaviour can be achieved by adding submit dependencies between the parentless tasks e.g:

@up => a
a[-PT1M]:submitted => a

This forces the sequentiality of a without prohibiting multiple instances of a from running in parallel.

oliver-sanders avatar Sep 20 '23 15:09 oliver-sanders

Note, the Cylc 7 spawning behaviour can be achieved by adding submit dependencies between the parentless tasks e.g:

@up => a
a[-PT1M]:submitted => a

This forces the sequentiality of a without prohibiting multiple instances of a from running in parallel.

IMO that introduces clutter and false dependencies into the workflow definition.. I would have to do that roughly thousands of times across all 50 workflows.. (a number of these workflows query all other workflows' tasks via workflow_state) (by false dependency I mean; you desire xtriggers not to run all at once to RH and/or reduction of UI clutter (changing the spawning behavior), you don't want a to depend on previous a in any way)

Also some tasks might be in queues or tight job scheduler queues (and require submission retries).. (it would also be in the visible in the graph window, and I kinda like the SoD Cylc8 minimalist behavior)

I have a change inbound, like what we messaged about in Element, where this will be solved without effecting the current behavior.. Using one line of workflow definition ..

dwsutherland avatar Sep 21 '23 05:09 dwsutherland

Restart/reload and remove are now handled..

Remove will cause the next occurrence to spawn, the argument being:

  • the non-sequential case would be spawned, so be consistent.
  • otherwise the workflow would shutdown if it's the only task (and we have a stop button for that).

dwsutherland avatar Oct 06 '23 08:10 dwsutherland

I assume the docs are auto-generated from the config definition.. but I suppose I should build it with this branch to find out.

dwsutherland avatar Oct 19 '23 07:10 dwsutherland

Think I've made the suggested changes .. however not sure why fast tests are failing

dwsutherland avatar Nov 29 '23 09:11 dwsutherland

TUI integration test failures are unrelated to your changes - see #5849

However there are a few functional test failures here

MetRonnie avatar Nov 29 '23 09:11 MetRonnie

However there are a few functional test failures here

Looks like it was the ValueError vs KeyError thing. Functional tests passing now 👍

MetRonnie avatar Nov 29 '23 13:11 MetRonnie

However there are a few functional test failures here

Looks like it was the ValueError vs KeyError thing. Functional tests passing now 👍

Yeah, I had in my head that it was a list for some reason 🤦‍♂️

dwsutherland avatar Nov 30 '23 12:11 dwsutherland

Oh I've just spotted earlier up the conversation @oliver-sanders was in the process of writing integration tests - any progress?

MetRonnie avatar Feb 08 '24 15:02 MetRonnie

I got as far as the code I commented here: https://github.com/cylc/cylc-flow/pull/5738#pullrequestreview-1747695603

But it looks like there were some issues documented in TODOs.

There are a lot of potential interactions in the task pool [e.g. 1]. So for testing, it's less about checking that the sequential spawning is working and more to do with testing the interaction of sequential spawning with other functionalities to make sure we don't have any runtime bugs lurking.

[1] With delayed spawning there's a risk that the task pool could become prematurely empty (e.g. as a result of client commands) leading the workflow to shut down as complete.

oliver-sanders avatar Feb 08 '24 15:02 oliver-sanders

I'll add more tests. (note: any test with clocktriggers will now test this behavior implicitly (in w/e ways it does))

dwsutherland avatar Feb 08 '24 22:02 dwsutherland

I've found some holes in the spawning of parentless tasks with unsatisfied sequential xtriggers (via Oliver's integration test, and replicated manually).. should be easy fix

dwsutherland avatar Feb 09 '24 09:02 dwsutherland

Added the integration tests (thanks Oliver), and have gotten the existing scenarios/tests working after adding more spawning conditions and requirements. Will add more scenarios next.

dwsutherland avatar Feb 13 '24 09:02 dwsutherland

hmm... all the TUI tests are failing...

dwsutherland avatar Feb 13 '24 09:02 dwsutherland

hmm... all the TUI tests are failing...

Kicking off a test run on master to make sure this isn't the result of an external factor: https://github.com/cylc/cylc-flow/actions/runs/7887199508

oliver-sanders avatar Feb 13 '24 13:02 oliver-sanders

Fixed on master now. Kicking tests

MetRonnie avatar Feb 13 '24 15:02 MetRonnie

Even though both clock triggers are sequential by default, I get all of the tasks spawning out to the runahead limit ~1s after each other

I guess the expected behavior would be to only spawn when all sequential clocktriggers are satisfied? will add that behavior in

dwsutherland avatar Feb 20 '24 02:02 dwsutherland

Even though both clock triggers are sequential by default, I get all of the tasks spawning out to the runahead limit ~1s after each other

I guess the expected behavior would be to only spawn when all sequential clocktriggers are satisfied? will add that behavior in

Yeah, not sure about this .. Maybe xtriggers shouldn't hold each other up, and xtriggers could be doing who knows what.. (and BTW - ~1s spawn out to the RH limit is because clock_1 is satisfied straight away, so the next is spawned, as expected)

The point of this PR is to introduce a reasonable (and optional) restriction on the unbounded spawning and xtrigger-checking out the the RH limit.

If an xtrigger is satisfied, then the next occurrence (which may or may not be the next task occurrence) can be checked.. For example (this could be any xtrigger):

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2018
    [[xtriggers]]
        clock_1 = wall_clock()
        clock_2 = wall_clock(P10Y)
    [[graph]]
        P2D = @clock_1 & @clock_2 => foo
        ^+P1D/P2D = @clock_1 => foo

At present it works: image

If we restrict spawning until both are ready, then the next foos that can run won't.

Of course, having only clock_2 on the first:

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2018
    [[xtriggers]]
        clock_1 = wall_clock()
        clock_2 = wall_clock(P10Y)
    [[graph]]
        P2D = @clock_2 => foo
        ^+P1D/P2D = @clock_1 => foo

Will stop foo from spawning to a cycle that it would be able to run.. But this is in the hands of the workflow designer.

However, I can understand that there is an inconsistency here:

The presence of an unsatisfied sequential xtrigger will prevent the task spawning, and hence, future checking of any non-sequential xtriggers (regardless of their satisfaction). Yet the presence of an unsatisfied sequential xtrigger does not prevent spawning of a task with satisfied sequential xtriggers

So in light of that, I think I'm full circle to, maybe the spawning should be held up in the presence of any unsatisfied sequential xtrigger.. It might be a bit strong, but at least this behavior will be by design.

dwsutherland avatar Feb 20 '24 04:02 dwsutherland

Ok fixed it now:

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2018
    [[xtriggers]]
        clock_1 = wall_clock()
        clock_2 = wall_clock(P10Y)
    [[graph]]
        P1D = @clock_1 & @clock_2 => foo

image

(TUI tests failing .. Seems unrelated, passed locally)

dwsutherland avatar Feb 21 '24 10:02 dwsutherland

Tui has been broken by an urwid release, don't upgrade beyond 1.6.1 and you'll be fine. There may be a fix soon.

oliver-sanders avatar Feb 21 '24 12:02 oliver-sanders

I've sorted out the conflict and added a few tests in https://github.com/dwsutherland/cylc-flow/pull/19

MetRonnie avatar Feb 22 '24 13:02 MetRonnie

Note, will need a changelog entry (autogenerated these days, see the CONTRIBUTING page) and possibly a small entry in the "External Triggers" section of cylc-doc?

oliver-sanders avatar Feb 26 '24 13:02 oliver-sanders

Can build off https://github.com/cylc/cylc-doc/pull/686

MetRonnie avatar Feb 26 '24 14:02 MetRonnie

I've sorted out the conflict and added a few tests in dwsutherland#19

Saw there were a lot of conflicts with that PR, I assume you had rebased on master, so I rebased this branch on master. Now all conflicts in your branch are only in one file (the integration test).

dwsutherland avatar Feb 28 '24 22:02 dwsutherland

I hadn't rebased, I had merged master into this branch for dwsutherland#19 (before you rebased this branch on top of master). There was a merge mistake when you pushed to dwsutherland#19 after this rebase so I've now force-pushed a fix onto this branch. If that makes sense lol

MetRonnie avatar Feb 29 '24 11:02 MetRonnie

Ah I see you've force pushed again - the merge mistake is here: https://github.com/cylc/cylc-flow/compare/598e0afe6c2e89cbcc6353f48e2363fe2dde34f2..8e7e3f17fc8832aa6ebb9a9dec800ec85d624762#diff-616dbf97f9b89762d7da3666f7ab3abf0cc3f2e6443b5367af9d25439a4321d5R334-R354

The fix is this

diff --git a/cylc/flow/xtrigger_mgr.py b/cylc/flow/xtrigger_mgr.py
index ea4b546da..762d8aa5b 100644
--- a/cylc/flow/xtrigger_mgr.py
+++ b/cylc/flow/xtrigger_mgr.py
@@ -17,5 +17,5 @@
 from contextlib import suppress
 from enum import Enum
-from inspect import getfullargspec, signature
+from inspect import signature
 import json
 import re
@@ -332,25 +332,4 @@ class XtriggerManager:
                 label, f"'{fname}' not callable in xtrigger module '{fname}'",
             )
-        x_argspec = getfullargspec(func)
-        if 'sequential' in x_argspec.args:
-            if (
-                x_argspec.defaults is None
-                or not isinstance(
-                    x_argspec.defaults[x_argspec.args.index('sequential')],
-                    bool
-                )
-            ):
-                raise XtriggerConfigError(
-                    label,
-                    fname,
-                    (
-                        f"xtrigger module '{fname}' contains reserved argument"
-                        " name 'sequential' that has no boolean default"
-                    ),
-                )
-            elif 'sequential' not in fctx.func_kwargs:
-                fctx.func_kwargs['sequential'] = x_argspec.defaults[
-                    x_argspec.args.index('sequential')
-                ]
 
         sig = signature(func)
diff --git a/tests/integration/test_config.py b/tests/integration/test_config.py
index 72ec69eda..0a186aa41 100644
--- a/tests/integration/test_config.py
+++ b/tests/integration/test_config.py
@@ -358,5 +358,5 @@ def test_xtrig_validation_wall_clock(
     })
     with pytest.raises(WorkflowConfigError, match=(
-        r'\[@myxt\] wall_clock\(offset=PT7MH, sequential=True\) validation failed: '
+        r'\[@myxt\] wall_clock\(offset=PT7MH\) validation failed: '
         r'Invalid offset: PT7MH'
     )):

MetRonnie avatar Feb 29 '24 11:02 MetRonnie

Ah I see you've force pushed again - the merge error is here: https://github.com/cylc/cylc-flow/compare/598e0afe6c2e89cbcc6353f48e2363fe2dde34f2..8e7e3f17fc8832aa6ebb9a9dec800ec85d624762#diff-616dbf97f9b89762d7da3666f7ab3abf0cc3f2e6443b5367af9d25439a4321d5R334-R354

Haha.. Apologies, yes, I rebased and did some tests.. will squash as fix to your commit

dwsutherland avatar Feb 29 '24 11:02 dwsutherland

@dwsutherland this is now my main focus for review - are you working on conflict resolution, post cylc set merge? Ping me if you need any help understanding the changes from that PR.

hjoliver avatar Mar 14 '24 20:03 hjoliver

@dwsutherland this is now my main focus for review - are you working on conflict resolution, post cylc set merge? Ping me if you need any help understanding the changes from that PR.

Yes, working on this now.. thanks will do

dwsutherland avatar Mar 15 '24 03:03 dwsutherland

Testing ... looks good so far, nice change.

hjoliver avatar Mar 21 '24 01:03 hjoliver