cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Optional outputs can cause legit partially-satisfied prereqiuisites

Open hjoliver opened this issue 2 years ago • 3 comments

We need to analyze partially satisfied prerequisites before using them as an excuse to stall. If some correspond to optional outputs, then partial satisfaction may not indicate an error condition.

Simplified version of user workflow:

[scheduling]
   [[graph]]
      R1 = """
         FAM:fail-all? => bad
         FAM:succeed-any? => good
     """
[runtime]
   [[FAM]]
   [[a]]
      inherit = FAM
      script = true
   [[b]]
      inherit = FAM
      script = false  # OK, graph says success is optional for me 
   [[bad]]
   [[good]]

Result:

INFO - [1/good running job:01 flows:1] => succeeded
WARNING - Partially satisfied prerequisites:
      * 1/bad is waiting on ['1/a:failed']
CRITICAL - Workflow stalled

The problem is: bad depends on multiple optional outputs, so we should not be surprised that it ends up partially satisfied.

Currently the only automatic solution is ... our old friend the suicide trigger 😠

"FAM:succeed-any? & FAM:finish-all? => !bad"

(We have to wait for all members of FAM to finish, because bad could be spawned after good triggers).

hjoliver avatar Sep 14 '23 00:09 hjoliver

To refine this a bit, it's not specifically to do with family triggers.

"a? & b? => c  # with a fails and b succeeds"

or even:

"foo:x? & foo:y? => bar  # with :x generated but not :y"

This will stall with c (and bar) partially satisfied.

The graph says, trigger c only if both a and b succeed, and they are expected to fail sometimes - so that's not a reason to assume an error has occurred.

hjoliver avatar Sep 14 '23 01:09 hjoliver

Assigned to 8.x milestone, but we should bring it forward it if possible - it's a bit nasty for users to understand and deal with.

hjoliver avatar Sep 14 '23 01:09 hjoliver

Sticking the question label on this to flag for our next meeting as this will probably require some hashing out of details.

oliver-sanders avatar Apr 11 '24 13:04 oliver-sanders

Documented this limitation and the suicide trigger workaround in https://github.com/cylc/cylc-doc/pull/772

oliver-sanders avatar Nov 07 '24 15:11 oliver-sanders

Documenting this is good for now, but I still think, generally, we should not stall if the reason for the stall is ONLY unsatisfied optional outputs. It shouldn't need suicide triggers. I think from memory of the initial discussion about this @dpmatthews did not like this suggestion? If so, what's the reason?

hjoliver avatar Nov 07 '24 22:11 hjoliver

I don't think there is any resistance to solving this issue if it is possible to do so, but there is concern that it might not be possible to resolve this issue safely without undermining other use cases.

This issue isn't going to be resolved for a while so we need to at least document it for now and see what can be done for it when the time allows.

oliver-sanders avatar Nov 08 '24 10:11 oliver-sanders

In my view, a? => c implies "if a succeeds then c must be run (when all it's pre-requisites are met)". If b? => c gets added to the graph (possibly by someone else in a completely different part of the graph) that shouldn't mean c might not run - that wouldn't be safe.

This doesn't just apply to optional outputs - the same can happen with optional branches.

If this is seen as a major problem then I think we'd need to introduce a new syntax, e.g. a? =? c implies "if a succeeds then run c if all it's pre-requisites are met"

dpmatthews avatar Nov 11 '24 09:11 dpmatthews

If this is seen as a major problem then I think we'd need to introduce a new syntax

Modelling "optionality" on the edge (i.e. trigger) was my initial suggestion (https://github.com/cylc/cylc-admin/pull/128) which was rejected in favour of modelling on the node (i.e. output) (https://github.com/cylc/cylc-admin/pull/129).

I think there's a lot to be said for modelling on the edge (e.g. access to the full power of conditional statements) and that it's probably a more "natural" paradigm for graph designers (users frequently get mixed up with optional outputs in a way that suggests they are thinking in triggers not outputs, e.g. why do I need a question mark on the RHS? Or, do I need a question mark on an optional task?).

However, migrating at this point is simply not feasible and supporting both concepts in parallel would make a confusing situation much worse (not to mention the challenging interaction space between the two). Easier just to document the suicide trigger approach for this relatively niche situation.

oliver-sanders avatar Nov 11 '24 09:11 oliver-sanders

Optional triggers would be nice in some ways, but more confusing in other ways, e.g: https://github.com/cylc/cylc-admin/pull/128#issuecomment-886303145

In my view, a? => c implies "if a succeeds then c must be run (when all it's pre-requisites are met)". If b? => c gets added to the graph (possibly by someone else in a completely different part of the graph) that shouldn't mean c might not run - that wouldn't be safe.

I can't say I agree with this @dpmatthews. a? => c means "c depends on the (optional) success of a" end of story. Putting your spin on top of that is an optional extra that I don't recall being agreed.

We always have to take all of a task's triggers into account, and they're often spread all over the graph configuration. By definition, if an output is optional it doesn't have to be generated, so logically a task that depends on it doesn't have to run. That also goes for a task that depends on multiple optional outputs. So there is no reason to think something is wrong if a task ends up with unsatisfied dependence on optional outputs, and accordingly we should just remove it and log the reason.

An unnecessary stall is a bad outcome too. It can, for instance, completely halt the progress of an important experiment over the weekend, or worse.

hjoliver avatar Nov 11 '24 21:11 hjoliver

I can't say I agree with this @dpmatthews. a? => c means "c depends on the (optional) success of a" end of story. Putting your spin on top of that is an optional extra that I don't recall being agreed.

My description simply describes the way Cylc has always worked - partially satisfied pre-requisites have always results in a stall.

By definition, if an output is optional it doesn't have to be generated, so logically a task that depends on it doesn't have to run.

If an output is required it doesn't have to be generated either (the task might not run). I don't think the fact that an output is marked as optional makes any difference.

dpmatthews avatar Nov 12 '24 08:11 dpmatthews

My description simply describes the way Cylc has always worked - partially satisfied pre-requisites have always results in a stall.

I'm talking about how it should work, not how it's always worked. I guess we don't need to do any more development then 😁

If an output is required it doesn't have to be generated either (the task might not run).

All of this is obviously predicated on "if the task runs".

If a task runs it is has to generate its required outputs (that's what required means!), and if it doesn't do so we punish it for not fulfilling its requirements (by retaining it as final incomplete until removed or completed).

If a task runs, it is not required to generate its optional outputs, and it doesn't make any sense to cause a stall entirely on the basis of non-completion of optional outputs. By definition they do not have to be completed, so there is no reason to assume something is wrong.

hjoliver avatar Nov 12 '24 09:11 hjoliver

@dpmatthews - as per the wording of my original examples above (and what I just said in my previous post) I am assuming that the parent tasks - i.e. the owners of the optional outputs - actually ran, and I think you're worried about cases were the parents tasks didn't run at all due to events further upstream.

Let's make the distinction explicit then, and it should satisfy both of us:

If the workflow stalls with a partially satisfied task present, examine its unsatisfied prerequisites:

  • IF they are all optional outputs AND all the parent tasks all ran to completion [update: by this I mean final status and fulfilled its completion condition], remove the task (and unstall the workflow)
  • otherwise (not optional outputs, or the parent tasks did not run at all) do not remove the task (and remain stalled)

Do you agree?

hjoliver avatar Nov 12 '24 09:11 hjoliver

Do you agree?

Granted, I haven't given much thought to this yet, but my answer is no.

The problem is that we do not have enough information to tell whether a partially-satisfied task "should" (by the user's intention) be run or not.

E.g, consider these two cases:

# user's intention: "b" is a required task, the workflow should stall if its prereqs are not met
a:succeeded? | a:failed? => b => c
# user's intention: "b" is an optional task, it should only be run if its prereqs are met
a:x? | a:y? => b

(note that submit-failed and expired are also "final" states, so the first example is not safe under the suggested logic above)

Both cases are expressed exactly the same, but the correct handling is completely different.

This is the result of modelling optionality on the node (i.e. the output). As Dave noted above, we could handle this safely by modelling optionality on the edge (i.e. the trigger):

# user's intention: "b" is a required task, it cannot be skipped
a:succeeded | a:failed => b => c
# user's intention: "c" is an optional task, it should only be run in a particular case
a:x | b:x =? c

Because this gives us a way of coupling optionality to consequences.

As a result, I think this is a fundamental limitation of optional outputs that is probably not resolvable? We could consider adding optional triggers, but I think supporting both models in parallel would be extremely confusing for users.

oliver-sanders avatar Nov 12 '24 11:11 oliver-sanders

I find that response quite confusing, for several reasons:

user's intention: "b" is a required task

What exactly do you mean by that? We don't actually support the concept of a "required task". Tasks can have required or optional outputs (which makes a lot of sense in terms of how real tasks actually work) and other tasks can trigger off of those outputs if they are generated. If a task runs to a final status without generating its required outputs, we retain it as final incomplete to cause a stall. End of story.

Further, your examples only have OR triggers and as such don't generate partially satisfied tasks at all, which is the whole point of this Issue. Am I supposed to think there are other triggers elsewhere in the graph that you haven't shown?

(note that submit-failed and expired are also "final" states, so the first example is not safe under the suggested logic above)

By "ran to completion" above I meant fulfilled its completion condition, so any missed optional outputs do not indicate a problem. Anything else will result in a stall due to the parent task itself being final incomplete, in which case we don't need to rely on partially satisfied children for the stall.

hjoliver avatar Nov 12 '24 20:11 hjoliver

Maybe I've misunderstood your responses, but I think you might have over-interpreted my proposal and therefore launched unnecessarily into optional triggers etc. I think what I'm trying to highlight is actually a fairly simple and solvable problem under the current optional outputs model. So let's return to the basics to try to drill down on that.

Presumably your objection must boil down to one of these?:

  1. it is never safe to remove a partially satisfied task because we can't be sure that it won't get satisfied in the future, or
  2. it is never safe to remove a partially satisfied task because it always indicates an error, or at least can't easily be distinguished from error cases, that must be highlighted

Is that right? (Note I'm giving 2. for completeness, even though "always an error" it is transparently wrong). If not, what exactly is your objection?

Anyhow, I think we can definitively identify cases where we can be sure that removing a partially satisfied task to prevent an unnecessary stall is the right thing to do. Can you please respond specifically to the following argument. If you do not agree, what specifically do you think is wrong with my logic below?

a:x? & a:y? => b 

# Result:
# 1. a ran and succeeded, generating output :x but not :y
#   (so b is partially satisfied, stuck in the pool waiting on a:y?)
# 2. b has no other triggers

If 1.-2. are true, then actually we wouldn't even have to wait for a stall to remove b.

  • the graph says "run b ONLY if a:x? AND a:y? are both generated"
    • (and being optional, they might not both be generated - and that's perfectly OK)
  • a ran successfully and fulfilled its completion condition in the workflow
  • so nothing went wrong, but b's trigger condition was not satisfied
    • (as should be expected sometimes - it depends on optional outputs)
  • therefore b should not be triggered

In this particular case at least, everything went as planned and the result, as expected sometimes, is don't trigger b. There is nothing in this to suggest that stalling the workflow is necessary or desirable, right? If fact, not to remove b is to cause an entirely unnecessary stall, which is a bad outcome for users.

[Aside: all else being equal, why should this stall when the single trigger case doesn't? a:y? => b]

[BTW this is really just an example of backing out of SoD's minimal pre-spawning implementation, which is done purely to track prerequisites, as described quite nicely by you here: https://github.com/cylc/cylc-flow/pull/6472#discussion_r1837879110 We pre-spawn b in order to track its prerequisites, so we can run it if they all get satisfied, but if they don't, and we know that (a) they won't and (b) nothing is wrong, we should back out of the spawning.]

hjoliver avatar Nov 13 '24 03:11 hjoliver

(Note we can avoid considering unsatisfied prerequisites where the parent tasks have not run to completion - if the workflow has not stalled yet, they may yet get satisfied; if the workflow has stalled due to presence of final incomplete tasks - that is an error condition and fixing it - by completing those tasks - may result in satisfying the prerequisites once the flow gets going again).

hjoliver avatar Nov 13 '24 03:11 hjoliver

Sorry, my example was bunk. If I can wind back a couple of posts...

Here's a nasty example that highlights the problem:

[scheduling]
    [[graph]]
        R1 = """
            a:x? & b:x? => x
            a:y? & b:y? => y

            x | y => c
        """

[runtime]
    [[a, b]]
        completion = succeeded and (x or y)
        [[[outputs]]]
            x = xxx
            y = yyy
    [[x, y]]
    [[c]]

We have two alternate pathways through the graph. The user wants Cylc to follow one of them.

User's intention:

a:x a:y
b:x run "x" stall
b:y stall run "y"

Cylc's current behaviour matches the user's intention.


a:x? & a:y? => b 

# Result:
# 1. a ran and succeeded, generating output :x but not :y
#   (so b is partially satisfied, stuck in the pool waiting on a:y?)
# 2. b has no other triggers

Note, this example matches your a:x? & a:y? => b example above, however (what I was attempting to explain in my initial post), the user's intention is completely different. In this case you want:

a:x a:y
b:x run "x" do-nothing
b:y do-nothing run "y"

Under current logic, this intention can be expressed like so (the newly documented workaround):

[scheduling]
    [[graph]]
        R1 = """
            a:x? & b:x? => x
            a:y? & b:y? => y

            a:x? | b:x? => !y
            a:y? | b:y? => !x

            x | y => c
        """

So, there are two possible intentions behind a:x? & a:y? => b.

You have assumed that having one, but not both outputs complete means the downstream task is not needed in any case (what I meant by "optional task"), however, this scenario can also indicate a genuine graph evolution error (what I meant by "required task").

We cannot tell the user's intention without more information then is provided by optional outputs alone. As a result, I don't think that this problem can be solved under this model, hence Dave's suggestion of using optional triggers to provide this required context.


Presumably your objection must boil down to one of these?:

  • it is never safe to remove a partially satisfied task because we can't be sure that it won't get satisfied in the future, or
  • it is never safe to remove a partially satisfied task because it always indicates an error, or at least can't easily be > distinguished from error cases, that must be highlighted

Again, there's no "objection" to solving this, I'm just not convinced that it is solvable.

I don't agree in full with either of the bullets above, (it is sometimes safe to remove a partially satisfied task, but, other times it is not). I would say:

  • It is not safe to remove a partially satisfied task because we cannot tell if it indicates an error or not.

oliver-sanders avatar Nov 13 '24 10:11 oliver-sanders

OK thanks for that, it makes a lot more sense now - the "bunk" example was really screwing with my head!

On reflection, it seems our respective perspectives boil down to this:


There are possible combinations of outputs that the graph does not explicitly handle. If such a combination eventuates at runtime and leaves a task with partially satisfied prerequisites:

  • NZ: the graph does not say to do anything in this situation, so don't do anything
    • (just back out of the pre-spawning, for prerequisite tracking, which created a task that turned out not to be needed)
  • UK: the graph does not say what to do in this situation, so stall the workflow
    • (the user might have actually intended a stall in this situation)

As stated there, neither of these approaches is illogical in principle, so maybe they are just a matter of preference, in which case unfortunately for me it seems I'm gonna be outvoted, but I still don't think your approach is the right one, for reasons:

This does not strike me as "graph evolution error" at all. I'm not assuming anything about the user's intention, just taking their graph at its word. The combination of events needed to trigger the partially-satisfied task did not occur, so don't trigger it, and there's no reason to think anything is wrong because the associated outputs were declared to be optional.

We have two alternate pathways through the graph. The user wants Cylc to follow one of them.

The user has given two pathways for two specific (combinations of) events, but according to their own workflow configuration those are not the only possibilities so how on earth can they reasonably want Cylc to follow one of the two paths every time. If you have binary alternate graph paths and you want one or the other to run every time, they had better be triggered by binary events.

User's intention: ... stall

Since when does a user "intend" their workflow to stall? In your table of outputs and intentions, surely "stall" should be replaced by what the user actually wants to happen if that combination of events occurs, if indeed it is not "do-nothing"? Can you give a concrete example where a stall is the true intention, and what the intervention would be to recover from that?

I think you're really making the claim that every combination of events must be handled explicitly, and if anything else occurs we should assume that the user didn't anticipate it and might want a different graph if they had anticipated it (versus they simply did not add graph paths for events that they don't care about).

If so, fine in principle, I just strongly prefer the simpler, less draconian approach. But in practice it's at least inconsistent to use partially satisfied prerequisites as a proxy for this state of affairs, which can easily occur without them:

a:x? => b
  # or
a => b

# a generates only optional :y, and succeeds.
# there are no graph paths defined for a:y
# this does not stall

hjoliver avatar Nov 14 '24 01:11 hjoliver

I'm probably with Hilary, and as he hinted, this wouldn't be an issue if we didn't spawn partially satisfied tasks.

If we only spawned parentless and prereq satisfied tasks the examples would run as they read, and the stall intentions would seem a little silly in this context.. But perhaps users expect something different with the practical concession of spawning partially satisfied tasks.

dwsutherland avatar Jan 28 '25 05:01 dwsutherland

https://cylc.discourse.group/t/graph-branching-vs-suicide-trigger/1111

hjoliver avatar Jan 29 '25 05:01 hjoliver

If we only spawned parentless and prereq satisfied tasks

Obviously, this would require a different strategy for knowing when a task can run.. Something like, given a task output lookup/generate what tasks have it as a prerequisite and check all other prerequisites of these tasks (from a DB of sorts), if satisfied run.

The partially satisfied tasks would turn up in the n=1 data-store window, and wouldn't be in the active pool to need removal if a path isn't taken...

Optional output stalling doesn't fit anywhere in this idealized spawning scheme.

(if xtriggers were represented as running tasks, then we wouldn't have any waiting tasks at n=0)

dwsutherland avatar Jan 29 '25 05:01 dwsutherland

Obviously, this would require a different strategy for knowing when a task can run..

I'm in danger of confusing the issue by going off topic here, but I don't think you mean "knowing when a task can run".

I think of spawning on the first output as merely an easy way (given Cylc internals) of tracking - and exposing to users - partially satisfied prerequisites. It doesn't make any difference to when tasks can run.

(if xtriggers were represented as running tasks, then we wouldn't have any waiting tasks at n=0)

Even better, an xtrigger (internal to the scheduler, no need to represent as a running task) should spawn dependent tasks on demand (i.e., at the moment the external condition becomes satisfied), rather than belonging to pre-spawned waiting tasks. Same result though, in terms of less waiting tasks in n=0.

hjoliver avatar Jan 29 '25 06:01 hjoliver

I'm in danger of confusing the issue by going off topic here

I was merely reducing our spawning strategy down to it's idealized limit to illustrate that stalling doesn't make sense for these optional output examples.

Even better, an xtrigger (internal to the scheduler, no need to represent as a running task) should spawn dependent tasks on demand

Yeah, only problem is you can't expand a window (to inform the user) about no tasks.

dwsutherland avatar Jan 29 '25 12:01 dwsutherland

I was merely reducing our spawning strategy down to it's idealized limit to illustrate that stalling doesn't make sense for these optional output examples.

Yes, I realized that, and I entirely agree! - but it doesn't change "knowing when a task can run" (maybe you didn't mean to use "run" there?)

Yeah, only problem is you can't expand a window (to inform the user) about no tasks.

Yes, we'd need a different strategy for exposing the right info to users. The current way, despite being a sort of kludge that leveraged existing Cylc internals, does have some advantages.

hjoliver avatar Jan 29 '25 21:01 hjoliver