cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Fix spawn order when group triggering tasks before the start cycle point

Open MetRonnie opened this issue 1 month ago • 16 comments

Following a warm start, group triggering tasks that exist in the initial cycle point only, such as so-called "install cold" tasks, has a bug where the prerequisites within the group are not obeyed - they end up force-satisfied and the whole group submits at the same time.

Repro

[scheduler]
    allow implicit tasks = True

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    runahead limit = P2
    [[graph]]
        R1 = herring => cold1 => cold2 => foo
        P1 = foo[-P1] => foo

[runtime]
    [[COLD]]
    [[cold1, cold2]]
        inherit = COLD
$ cylc play wflow --startcp 5
$ cylc trigger wflow//^/COLD

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] No dependency changes
  • [x] Tests are included
  • [x] Changelog entry included if this is a change that can affect users
  • [x] No docs needed
  • [x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

MetRonnie avatar Nov 26 '25 13:11 MetRonnie

flake8-bugbear failure is due to a new rule. Fixed by https://github.com/cylc/cylc-flow/pull/7111

MetRonnie avatar Dec 01 '25 15:12 MetRonnie

Kicking tests

MetRonnie avatar Dec 01 '25 17:12 MetRonnie

(coverage failing on uncovered repr methods)

oliver-sanders avatar Dec 02 '25 13:12 oliver-sanders

(coverage failing on uncovered repr methods)

Added doctests

MetRonnie avatar Dec 05 '25 12:12 MetRonnie

The workflow flowing on from the triggered tasks is expected (but problematic) at the moment; I have not found a fix for that yet. However the runahead stall in your example is probably preferential to it not stalling, as it gives a chance to remove the unwanted tasks!

MetRonnie avatar Dec 05 '25 14:12 MetRonnie

Understood, but unfortunately, I think that we need to come up with a solution to the flow-on problem before the intervention for this use case can be used in anger as it's too caveat-prone without this.

No idea what that solution would be however!

oliver-sanders avatar Dec 05 '25 15:12 oliver-sanders

Does this not create the behaviour we require?

$ cylc trigger '^/cold'  --flow=none  # trigger a R1 task

(however we lose the group trigger interdependence if we trigger a set of tasks like this)

wxtim avatar Dec 08 '25 13:12 wxtim

(however we lose the group trigger interdependence if we trigger a set of tasks like this)

That is the problem. Fortunately only a handful of operational workflows have cold start tasks with prereqs between them at the Met Office

MetRonnie avatar Dec 08 '25 13:12 MetRonnie

(however we lose the group trigger interdependence if we trigger a set of tasks like this)

That is the problem. Fortunately only a handful of operational workflows have cold start tasks with prereqs between them at the Met Office

Unfortunately these are near-ubiquitous here at ESNZ, right @dwsutherland ? (Probably because for years now we've had tasks that deploy code into the run-dir from git repos). Which is one of the main reasons we've wanted this forever:

  • https://github.com/cylc/cylc-flow/issues/7020
  • (in fact we commented on this flow-on problem there: "it makes retriggering the startup graph difficult or confusing (will it "flow on" again?)" - the comment probably dates back to when we had to use a new flow for retriggering, but moving the workflow start point forward brings the problem back again)

Otherwise, we've had several discussions in the past about how to stop flows from flowing on:

  • https://github.com/cylc/cylc-flow/issues/3750
  • https://github.com/cylc/cylc-flow/issues/4741#issuecomment-1071905728

Possibilities include:

  • starting a flow with a defined end cycle-point
  • a second command to tell an existing flow when or where to stop

(These require more understanding of flows from the user).

Maybe a short-cut variation on the flow-end-point idea?

  • cylc trigger --single-cycle-point=^

hjoliver avatar Dec 09 '25 02:12 hjoliver

@hjoliver, the issue we're discussing here is specific to warm starts only, but isn't strictly specific to R1 tasks (though the use case we're focusing on is).

Generalisation of the problem: The triggering of tasks before the start cycle point in a warm started workflow [1].

Problems with this ATM:

  1. In-group dependencies are ignored (fixed by this PR).
  2. Cylc flows on from the triggered tasks (the remainder of this discussion).

Re-triggering R1 tasks is normally no issue, Cylc does not flow-on because it looks in the DB and discovers that the downstream tasks have already run. This mechanism works just fine for cold starts...

However, with warm starts [1], we delete the workflow database and restart from a specified "start cycle point". Cylc assumes that everything before the start cycle point has succeeded [2] as part of the workflow startup logic, however, this assumption is constrained to the startup logic. When those R1 tasks run, their outputs cause downstreams to run because they do not exist in the database (because we deleted the database, this is a warm start!). Note that with a warm start, there is only one flow (also also that we do not use new flows at the MO), it's not the starting and stopping of flows that's an issue, it's the lack of workflow history (because it's a warm start!).

This is really an internal consistency issue, one part of Cylc (startup logic) is saying that everything before the start cycle point has succeeded, whereas another part of Cylc (pre-spawn check) is saying that they didn't. As a result, the behaviour is unhelpful and defying user expectations.

However, I think we can resolve this issue by patching the pre-spawn check to match/reflect the warm start logic. If a task is before the start cycle point, we would simply assume it to have succeeded in the absence of a DB entry to the contrary. This would make the startup and task pool logic consistent, the resulting behaviours would match the cold-start scenario:

Under this approach:

  • If you trigger a task before the start cycle point, it will run, but not spawn downstreams.
  • If you trigger a group of tasks before the start cycle point, they will run in order (with this PR), but not spawn downstreams.
  • If you remove, set or trigger a task, then an entry would be created, overwriting the defaut "assume this task succeeded" logic.
  • No new commands, options or semantics required, warm-start behaviours match cold-start ones for default options.

WDYT

Notes: [1] Warm start meaning, shut down the workflow, delete the DB, start the workflow from a specified start cycle point. This is not an everyday intervention. Warm starts are a mostly just a useful backstop for emergency situations. [2] By "succeeded" I mean "final completed".

oliver-sanders avatar Dec 09 '25 14:12 oliver-sanders

[UPDATE: read my follow-up before responding to these comments - they're not wrong, but I like the proposed alternative!]

@oliver-sanders - I understand what a warm-start is - I think we just have a different take on the proper generalization.

Note that with a warm start, there is only one flow ..., it's not the starting and stopping of flows that's an issue, it's the lack of workflow history (because it's a warm start!).

Well, given the lack of history to stop the flow (due to deliberate deletion of it!) stopping the flow is the problem! So, this is (or at least can be viewed as) a particular example of the more general need for control over flow termination. (Note I mean "the flow" in a generic sense - it doesn't have to be a new flow).

I think you're really saying that users expect there to be history to stop flow 1 from continuing, even though they deliberately deleted the history! (Or that they shouldn't even have to understand that deleting the DB does that - meh, that's a pretty violent action, I don't think it's unreasonable to have to think about the consequences!).

Also, note that users would not necessarily have to use the low level flow control capability directly (c.f. group trigger vs lower level manual remove, set, and trigger). E.g. for this sort of use case we could provide something like this:

 cylc trigger --no-flow-on <task-ids>

to mean run <task-ids> with a new flow configured to stop after that group.


This is really an internal consistency issue, one part of Cylc (startup logic) is saying that everything before the start cycle point has succeeded, whereas another part of Cylc (pre-spawn check) is saying that they didn't.

I do see this perspective as well, although arguably the real problem is that the startup logic is an imperfect bodge that affects more graph than it should do - it was a convenience, to bootstrap initial inter-cycle triggers so that users don't have to define the real start-up dependencies explicitly, but in addition it wipes out all dependencies back to the start of the graph

hjoliver avatar Dec 09 '25 20:12 hjoliver

However, I think we can resolve this issue by patching the pre-spawn check to match/reflect the warm start logic. If a task is before the start cycle point, we would simply assume it to have succeeded in the absence of a DB entry to the contrary. This would make the startup and task pool logic consistent, the resulting behaviours would match the cold-start scenario: WDYT

Nice.

I stand by my comments above - if we had general flow termination capability we could use that to solve this specific problem.

However, I like your suggestion! It makes sense, it's easy to implement, and it solves this specific problem quickly. I guess sometimes a less general solution wins out...

(BTW this is also a means of flow termination, just not a general one in that it is specific to pre-start points: we will assume flow history prior to the start point even if it does not exist in the DB, which will terminate the flow downstream of triggered tasks.)


So, should @MetRonnie do that on this PR, or shall we have a follow-up PR (which had better be released at the same time)?

hjoliver avatar Dec 09 '25 21:12 hjoliver

I think I've implemented @oliver-sanders' suggestion in https://github.com/MetRonnie/cylc-flow/compare/group-trigger-warm...wxtim:cylc:group-trigger-warm, (not suitable for a PR, contains manual test), but the test case workflow only runs cold1. Can I invite @oliver-sanders to check my change and @MetRonnie to check the combination?

wxtim avatar Dec 10 '25 11:12 wxtim

@wxtim, yep, that's the right start and it works for triggering a single task, correctly suppressing flow-on.

However, it doesn't work for triggering a group of tasks as all downstreams are considered complete. This is where the flow=None bit comes in. To additionally need to make it look like these tasks exist to cylc remove so that the in-group tasks get removed resulting in flow=None entries in the DB.

oliver-sanders avatar Dec 10 '25 14:12 oliver-sanders

Leaving this with @MetRonnie

wxtim avatar Dec 11 '25 13:12 wxtim

(I have finished working on this PR, it's ready for review. The bug fix for the flow-on problem is on a separate branch)

MetRonnie avatar Dec 12 '25 16:12 MetRonnie