cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Feature: Sim mode at runtime

Open wxtim opened this issue 2 years ago • 11 comments

Recheck simulation mode settings at runtime if a broadcast might have changed them. If it has, update these settings.

A waypoint on the road to skip mode.

~Based upon https://github.com/cylc/cylc-flow/pull/5712. Drafted until that is in.~

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • [x] Tests are included (or explain why tests are not needed).
  • [x] CHANGES.md entry included if this is a change that can affect users
  • [x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • [x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

wxtim avatar Sep 06 '23 15:09 wxtim

Kicking tests

MetRonnie avatar Dec 04 '23 17:12 MetRonnie

tests/integration/tui/test_updater.py::test_filters - TypeError: 'NoneType' object is not subscriptable

Looks genuine?

MetRonnie avatar Dec 04 '23 17:12 MetRonnie

tests/integration/tui/test_updater.py::test_filters - TypeError: 'NoneType' object is not subscriptable

Looks genuine?

Yes, it was both genuine and potentially nasty. Hopefully ok now, although it makes me nervous about how well tested this code is: 100% coverage of code ≠ 100% coverage of things users will do

wxtim avatar Dec 05 '23 11:12 wxtim

@oliver-sanders

wxtim avatar Jan 15 '24 15:01 wxtim

However foo fails again

fail try 1 only != fail submit 1 only: It only plays with execution retry delays, not manual resubmits.

I'm not sure if that's OK, but that's the current behaviour. I'm considering raising a "Question" issue.

wxtim avatar Feb 13 '24 10:02 wxtim

That's OK, but the docs could do with that detail

image

MetRonnie avatar Feb 13 '24 10:02 MetRonnie

That's OK, but the docs could do with that detail

image

Agreed - see the commit immediately above this comment. 😄 (https://github.com/cylc/cylc-flow/pull/5721/commits/e9a087e77eca37b46573cbb6d4cdf7f25fd8ced0)

wxtim avatar Feb 13 '24 11:02 wxtim

Are the mode_settings are being created when the task is created (which would be immediately after 1/a submits in this case) rather than when the task is submitted (which is after the broadcast)?

No.

It turns out that this problem is caused by sim_task_failed expecting a type of Union[PointBase, Literal["all"]] for sim_conf['fail cycle points'], but not running the validation method parse_fail_cycle_points after the broadcast.

I wonder if the check parse_fail_cycle_points should be a parsec validator, so that Cylc Broadcast throws out garbage values before broadcasting them. As it is you can broadcast rubbish which will then stop the workflow:

# With Oliver's example workflow
> cylc broadcast left.lemon -n a -p 2 -s '[simulation]fail cycle points = aiusrfhpawiue4h'

then in the log

2024-02-27T09:38:42Z INFO - Broadcast set:
    + [2/a] [simulation]fail cycle points=aiusrfhpawiue4h
...
# workflow keeps going for some time
...
2024-02-27T09:38:59Z ERROR - Workflow shutting down - PointParsingError: Incompatible value for <class 'cylc.flow.cycling.integer.IntegerPoint'>: aiusrfhpawiue4h:
    invalid literal for int() with base 10: 'aiusrfhpawiue4h'

wxtim avatar Feb 26 '24 16:02 wxtim


I tried setting fail try 1 only=True after foo had already retried a few times and got this

Think that I've fixed this.

wxtim avatar Mar 06 '24 09:03 wxtim

Broadcasting [simulation]default run length=PT1S causes the task to succeed 😕

Got it - side effect of earlier change.

wxtim avatar Mar 06 '24 17:03 wxtim

  • Broadcasting execution retry delays does not cause a change when running in simulation mode
  • Also, in the Edit Runtime form, submission retry delays shows as 1 for some reason even though it is not set in flow.cylc

MetRonnie avatar Mar 07 '24 11:03 MetRonnie

  • Broadcasting execution retry delays does not cause a change when running in simulation mode

I cannot replicate this on the command line. Given the context of your comment I tried it from the gui and still couldn't replicate.

In fact some of the tests I've written should fail if that were the case.

Can you come up with a concrete example?

Replicated the submission retry delays thing. Pretty clueless about that though.

wxtim avatar Mar 08 '24 09:03 wxtim

  • Broadcasting execution retry delays does not cause a change when running in simulation mode

I cannot replicate this on the command line. Given the context of your comment I tried it from the gui and still couldn't replicate.

In fact some of the tests I've written should fail if that were the case.

Can you come up with a concrete example?

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2024
    runahead limit = P0
    [[graph]]
        P1M = foo => bar

[runtime]
    [[foo]]
        execution retry delays = 100*PT10S
        script = false
        [[[simulation]]]
            fail cycle points = all
            default run length = PT10S
            fail try 1 only = False
cylc broadcast sim-mode -n foo -s 'execution retry delays=20*PT1S'

But it still keeps saying

WARNING - [20240101T0000Z/foo waiting job:06 flows:1] retrying in PT10S

(The Edit Runtime form shows it correctly but it is not having an effect)

MetRonnie avatar Mar 08 '24 11:03 MetRonnie

@MetRonnie - This is a good spot. I think that the broadcasts probably need to be applied in the TaskJobManager._simulation_submit_task_jobs method to ensure that we are broadcasting all task settings, not just those in [simulation]. I think you might also have found problems clearing broadcasts with things as they were, since I was changing the itask.tdef.rtconfig not a copy of it.

wxtim avatar Mar 11 '24 09:03 wxtim

Conflict resolution looks good

Thank you @hjoliver

wxtim avatar Mar 14 '24 15:03 wxtim

Genuine test failure presumably: tests/f/modes/05-sim-trigger.t

MetRonnie avatar Mar 14 '24 17:03 MetRonnie

I'm looking at the broken test -it's just a log-grep issue after cylc set changes to log messages... I've pushed a fix to the PR branch.

hjoliver avatar Mar 14 '24 22:03 hjoliver