cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Implement "cylc set" command

Open hjoliver opened this issue 1 year ago • 103 comments

Implement accepted proposal A New Command for Manual Task Forcing

  • Close #5642
  • Close #5638
  • Close #5572
  • Close #5882

Supersede #5412 (fix expire triggers)

  • Close #5364
  • Close #5354
  • Close #5447
  • Close #4527

TODO

  • [x] Rename set-outputs to set
  • [x] Update the CLI and GraphQL schema
  • [x] Custom outputs: prohibit "all", "required", "optional" keywords, "," character, and "_cylc" prefix.
  • [x] Reproduce old set-outputs behaviour with new command options and query
  • [x] Implement the new logic for setting prerequisites [FIRST CUT done]
  • [x] Implement the new logic for setting outputs
  • [x] Test new schema works as expected in the UI
  • [x] Tests for all new functionality
  • [x] Update documentation -> Punted to https://github.com/cylc/cylc-doc/issues/688
  • [x] This is a breaking change, add an entry into the cylc-doc changes page -> Punted to https://github.com/cylc/cylc-doc/issues/688

Side effects:

  • rename cylc set-verbosity to cylc verbosity to avoid command ambiguity apocalypse
  • command UUIDs
  • cleaner log message prefix for tasks

Follow-up:

  • cylc set --pre=cycle

PR Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • [x] Tests are included (or explain why tests are not needed).
  • [x] CHANGES.md entry included if this is a change that can affect users

hjoliver avatar Jul 31 '23 05:07 hjoliver

Rebased. Tests fixed. Will finish this off ASAP.

hjoliver avatar Sep 04 '23 07:09 hjoliver

Unrelated (?) mypy error from the Py 3.10 and 3.11 fast tests:

/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/matplotlib/pyplot.py:2700:
 error: Positional-only parameters are only supported in Python 3.8 and greater  [syntax]
Found 1 error in 1 file (errors prevented further checking)

[update: fixed by #5735]

hjoliver avatar Sep 19 '23 01:09 hjoliver

@oliver-sanders - I just took a quick look at your "Extension" part of the proposal:

When setting prereqs/outputs it would be useful to provide visual confirmation of the result. We could display the cylc show output (which can be generated Scheduler side without the requirement for a second API call).

Correct me if I'm wrong, but to do this we need to take the set call out of the asynchronous command queue in the scheduler, and instead call it directly via a wrapper method in the resolver.

The trouble is, the set call updates task DB tables, and I get this:

self.conn.executemany(stmt, stmt_args_list)
    sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. 

hjoliver avatar Sep 25 '23 04:09 hjoliver

@oliver-sanders - this now incorporates (and supersedes) #5412 because I needed to handle (and test consequences of, such as triggering) set to expired. So there will be a bit of overlap and conflict with #5651, but hopefully not too much.

hjoliver avatar Sep 26 '23 02:09 hjoliver

@oliver-sanders - I think there's a small error in the proposal example 5

$ cylc set //foo --out=a,succeeded --wait

The database will record:

foo: state=waiting, outputs={a}

The DB should record state=succeeded and outputs={a, submitted, started, succeeded} because:

  • completed outputs affect task status
  • we set succeeded, and it has implied outputs

Agree?

hjoliver avatar Sep 26 '23 12:09 hjoliver

The DB should record state=succeeded and outputs={a, submitted, started, succeeded} because:

  • completed outputs affect task status
  • we set succeeded, and it has implied outputs

Agreed, the example said --out=a,succeeded so the missing submitted, started, succeeded were definitely an omission.

oliver-sanders avatar Sep 26 '23 15:09 oliver-sanders

Correct me if I'm wrong, but to do this we need to take the set call out of the asynchronous command queue in the scheduler, and instead call it directly via a wrapper method in the resolver.

The trouble is, the set call updates task DB tables, and I get this:

I said it would be nice, I didn't say it would be easy! Outside of brain cache at this point :(

The resolvers can see the store, and the store should have all the required prerequisite info in it, so we should be able to read from the store rather than the DB for the cylc show output. So the remaining issue is about knowing when the "set" operation has completed which is the same issue as #3329.

oliver-sanders avatar Sep 26 '23 15:09 oliver-sanders

Undrafting this now as I think it implements the proposal fully.

I've added functional tests that reproduce the examples in the proposal (IMO functional tests are needed for this sort of thing, although I'll consider other tests too).

From a quick look I'd say the patch coverage seems to be wrong, maybe because the remote tests are failing (unrelated).

Still to do:

  • more tidying?
  • more tests?
  • test via the UI
  • docs

hjoliver avatar Sep 27 '23 03:09 hjoliver

From a quick look I'd say the patch coverage seems to be wrong, maybe because the remote tests are failing (unrelated).

Caused by an external issue, now fixed in the upstream repo.

IMO functional tests are needed for this sort of thing, although I'll consider other tests too

IMO functional tests are absolutely necessary to ensure the end-to-end functionality and we definitely need some. However, I don't think we should attempt to test every possible option in this way, only examples of the general functionality, because functional tests are hard to maintain, slow to run and difficult to debug.

Note, the integration testing framework runs the scheduler, the whole scheduler and nothing but the scheduler, the only thing it's really cutting out is the CLI. You can do everything, up to and including reference tests with it if so inclined:

async with run(schd) as log:
    # run the workflow with a timeout of 60 seconds
    await asyncio.sleep(60)
assert reftest(log) == '''
1/b triggered off [1/a]
1/c triggered off [1/b]
'''

Although that's really a just functional test implemented in Python.

For a more integration'ey approach to reftests we can do something like this which is essentially just another way of getting the "triggered off" information without having to run the main loop and bring race conditions into play.

async with start(schd):
    assert set(schd.pool.get_tasks()) == {'1/a'}

    # setting a:succeeded should spawn b
    schd.command_reset('1/a', 'succeeded')
    assert set(schd.pool.get_tasks()) == {'1/b'}
    
    # setting b:x should spawn c
    schd.command_reset('1/b', 'x')
    assert set(schd.pool.get_tasks()) == {'1/b', '1/c'}

oliver-sanders avatar Sep 27 '23 11:09 oliver-sanders

(Almost there...)

hjoliver avatar Oct 30 '23 07:10 hjoliver

CLI auto-completion for the --pre and --out arguments: https://github.com/hjoliver/cylc-flow/pull/34

Can add set --out=succeeded an an option to cylc tui after the current PR has been merged.

oliver-sanders avatar Nov 14 '23 15:11 oliver-sanders

(The only small hold-up here is: I removed the functional tests, and haven't finished replacing them with integration tests ... probably should have done that as a follow-up! - but during that process made some other tidy-up changes that broke a few other tests ... almost done).

hjoliver avatar Nov 15 '23 02:11 hjoliver

It might be worth inferring :succeeded to match graph behaviour?

Yes, done.

hjoliver avatar Nov 15 '23 02:11 hjoliver

Still not done adding tests. This afternoon I got stuck on the failing event-mail functional tests. They are failing for me locally too, but possibly for different reasons ... can someone with 5 mins spare run them locally and compare master with this branch?

(And now I've broken one integration test, but it's a trivial fix...)

hjoliver avatar Nov 15 '23 05:11 hjoliver

Sorry, forgot to mention, the mail event tests got broken on master (mb, I merged something before it was ready), Tim's working on a fix.

oliver-sanders avatar Nov 15 '23 09:11 oliver-sanders

This illogical command had a curious outcome: cylc set --out=succeeded,failed

That's not illogical. A single running job can't both succeed and fail, but a task can fail and then succeed on retry, in which case the DB task_outputs table records both outputs.

For cylc set, maybe we want to test both success and fail branches at once...

So it seems to me we should leave that as-is.

hjoliver avatar Nov 16 '23 05:11 hjoliver

@oliver-sanders - I think I've address all your comments, and all the tests (except the usual MacOS ones) are passing now. I just need to add more integration tests tomorrow (having removed my new functional tests, at least for the moment; if I run out of time tomorrow I'll put those back in and do a follow-up PR for the integration tests, so we can get moving on this ... I'm going on leave for a week, next week)

hjoliver avatar Nov 16 '23 10:11 hjoliver

That's not illogical. A single running job can't both succeed and fail

Fair enough, my confusion was more that running cylc set --out=succeeded,failed resulted in the task being left in the succeeded state irrespective of what order you specify succeeded/failed in.

I've tested it with a branched workflow and confirmed that both pathways are being run, what state the task goes into is an implementation choice, I guess the reason for this is probably the sort order of succeeded/failed?

oliver-sanders avatar Nov 16 '23 11:11 oliver-sanders

Pretty close, works nicely for most cases. Going through the use cases in the proposal I've spotted a few issues...

1. I don't think "set" is working with globs?

Glob requirement from use case (3).

[scheduling]
    [[graph]]
        R1 = a
[runtime]
    [[a]]
        script = false
$ cylc dump baz -t
a, 1, failed, not-held, not-queued, not-runahead
$ cylc set 'baz//*' --out=succeeded
Command submitted; the scheduler will log any problems.
$ cylc dump baz -t
a, 1, failed, not-held, not-queued, not-runahead
$ cylc cat baz | tail -n 7
2023-11-16T11:22:07Z WARNING - No matching tasks found: */* (Incompatible value for <class 'cylc.flow.cycling.integer.IntegerPoint'>: *: invalid literal for int() with base 10: '*')
2023-11-16T11:22:07Z INFO - Command "set" actioned: id=82cd8248-e235-4f8d-8cd9-528e9d72d257
2023-11-16T11:22:07Z WARNING - stall timer stopped
2023-11-16T11:22:07Z ERROR - Incomplete tasks:
      * 1/a did not complete required outputs: ['succeeded']
2023-11-16T11:22:07Z CRITICAL - Workflow stalled
2023-11-16T11:22:07Z WARNING - PT1H stall timer starts NOW

Note the error:

Incompatible value for <class 'cylc.flow.cycling.iso8601.ISO8601Point'>

2. Workflow can prematurely shut down with --pre=succeeded.

From use case (5)

This workflow will run indefinitely:

[scheduler]
    allow implicit tasks = True

[scheduling]
    initial cycle point = 1
    cycling mode = integer
    runahead limit = P0
    [[graph]]
        P1 = """
            a:x? => x
            a:y? => y
            x | y => z
        """

[runtime]
    [[a]]
        script = cylc message -- x
        [[[outputs]]]
            x = x
            y = y

But jam this in and it will shut down after running 2/z because the spawning of the parentless task a is interrupted:

    [[z]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT -eq 1 ]]; then
                cylc set "${CYLC_WORKFLOW_ID}//2/a" --out=y,succeeded
            fi
        """

3. Incomplete tasks are not run.

Starting with the same example as (2) above but jamming in this:

    [[z]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT -eq 1 ]]; then
                cylc set "${CYLC_WORKFLOW_ID}//2/a" --out=y
            fi
        """

We get really strange behaviour.

The cylc set command, sets the output y, but not succeeded or failed so the task is both not-finished and not-complete. Because the task is incomplete, the approaching flow should run it. And according to the log, Tui, dump, etc it does:

INFO - [2/a running(runahead) job:00 flows:1] => running

But in reality it doesn't:

$ ls ~/cylc-run/future/log/job/2
y  z

And the workflow hangs with a non-existent running task?!

4. Expired task state.

A couple of minor issues with expired tasks.

[scheduler]
    allow implicit tasks = True
    cycle point format = CCYY
[scheduling]
    initial cycle point = 2020
    [[special tasks]]
        clock-expire = a
    [[graph]]
        P1Y = """
            z[-P1Y] => a
            a:expire? => e
            a:succeed => s
            e | s => z
        """

The task a expires correctly and the workflow runs as expected (the important part).

However, the expired task retains the queued state (logs, tui, dump, etc) and is logged as incomplete:

2023-11-16T13:28:41Z WARNING - [2023/a expired(queued) job:00 flows:1] did not complete required outputs: ['submitted', 'succeeded']

I think we should set is_queued=False with itask.state_reset(TASK_STATUS_EXPIRED) and skip the incomplete logging.

5. Workflow hangs after spawning parentless task.

From use case (7)

[scheduler]
    allow implicit tasks = True
    cycle point format = CCYY
[scheduling]
    initial cycle point = now
    [[graph]]
        P1Y = @wall_clock => a => b
[runtime]
    [[a]]
        script = """
            # skip to cycle 2030 by stopping the flow here
            cylc stop --flow=1 "${CYLC_WORKFLOW_ID}"
            # then spawning the parentless task a in cycle 2030 (use case 7)
            cylc set --pre=all "${CYLC_WORKFLOW_ID}//2030/a"
        """

This example almost works as expected:

INFO - [2023/a running job:01 flows:none] => succeeded
INFO - [2030/a waiting(runahead) job:00 flows:1] => waiting
INFO - [2031/a waiting(runahead) job:00 flows:1] => waiting
INFO - [2032/a waiting(runahead) job:00 flows:1] => waiting
INFO - [2033/a waiting(runahead) job:00 flows:1] => waiting
INFO - [2034/a waiting(runahead) job:00 flows:1] => waiting

But strangely, none of those waiting tasks ever get run and the workflow hangs?!

The runahead limit appears to get updated correctly, not sure what the cause is?

6. Workflow stops prematurely when spawning parentless task.

Same example as before, but running the commands manually rather than automatically in [runtime][a]script.

$ cylc pause orphan
$ cylc stop --flow=1 orphan
$ cylc set --pre=all orphan//2030/a
$ cylc play orphan//
INFO - Command "set" received: id=c443704e-c534-4421-be23-155441823580
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=['all'], tasks=['2030/a'])
INFO - Command "set" actioned: id=c443704e-c534-4421-be23-155441823580
INFO - Command "resume" received: id=a44a9163-c0fb-4f87-b590-f65efc435639
INFO - RESUMING the workflow now
INFO - Command "resume" actioned: id=a44a9163-c0fb-4f87-b590-f65efc435639
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

oliver-sanders avatar Nov 16 '23 13:11 oliver-sanders

I've been testing with the Tui branch merged in, this patch is handy:

diff --git a/cylc/flow/tui/data.py b/cylc/flow/tui/data.py
index 04c3d7220..0feb6b8ae 100644
--- a/cylc/flow/tui/data.py
+++ b/cylc/flow/tui/data.py
@@ -107,6 +107,7 @@ MUTATIONS = {
     'task': [
         'hold',
         'release',
+        'set',
         'kill',
         'trigger',
         'poll',

oliver-sanders avatar Nov 16 '23 14:11 oliver-sanders

  1. I don't think "set" is working with globs?

It works with task name globs (cylc set --out=succeeded "baz//1/*"); I guess I hadn't tested with //*.

[UPDATE]: actually I had forgotten, but this was deliberate - because (a) "set" has to be able to target any task, and (b) we have open task globbing issues still under discussion, It isn't trivial to match both in and out of the task pool within the same command invocation when the command affects pool content (set-outputs can spawn new tasks into the pool).

So currently on this branch you can't use cycle point globs, but that use case works fine with specific cycle points. (Mind you, cycle point globbing is considerably less important than task name globbing anyway). Matching by state selector also remains a TODO on this branch, since that's task-pool specific.

On the weekend, before going away Monday, I'll do my best get this branch ready and explain exactly what needs doing as a follow-up.

hjoliver avatar Nov 16 '23 21:11 hjoliver

Although, nice work on the stress-testing, looks like I have my work cut out for me! Unfortunately I blew my time today looking at the globbing issues.

hjoliver avatar Nov 17 '23 09:11 hjoliver

[UPDATE]: actually I had forgotten, but this was deliberate

So that means we can't cylc set '*' --out=succeeded?

Can we stick to task pool globbing for now until we've flushed out a way forward, having different rules for different commands is too confusing.

oliver-sanders avatar Nov 17 '23 10:11 oliver-sanders

[UPDATE]: actually I had forgotten, but this was deliberate

So that means we can't cylc set '*' --out=succeeded?

Yes, but we can do cylc set 1/* --out=succeeded, which is pretty easy.

Can we stick to task pool globbing for now until we've flushed out a way forward, having different rules for different commands is too confusing.

OK, I'll revert for now. [Actually, not "revert" - see next comment below]. But note:

  • the fact that globbing currently only works in the pool already amounts to a form of "different rules for different commands" and has already caused problems for users
  • different commands necessarily act on different populations of tasks, which has implications for the underlying matching rules (although some of those differences can be hidden from users)

So we really need to sort this out pretty quickly. I'll bang up a new "task selection" issue to collate and supersede the current partial ones... #5827

hjoliver avatar Nov 18 '23 22:11 hjoliver

OK, I'll revert for now.

Actually set-outputs on master does not do pool-only globbing. It globs universally on task names, but doesn't allow cycle point globs. So I had not actually changed it on this branch.

I will change it to do that now, however, pending resolution of #5827

hjoliver avatar Nov 18 '23 22:11 hjoliver

(Reverting to DRAFT until I've addressed all review points - as several are non-trivial)

hjoliver avatar Dec 01 '23 02:12 hjoliver

New side-effect noted in the description: - cleaner log message prefix for tasks.

Old:

[1/foo submitted job:01 flows:1] => running 
[1/bar submitted job:03 flows:1,2] => running

New:

[1/foo/01:submitted] => running
[1/foo/01(1,2):submitted] => running

I should probably have flagged this first, but:

  • the job submit number is now technically part of the ID so that part should not be controversial
  • @oliver-sanders already expressed a preference some time ago for ommiting flow 1, so that users who don't use flows don't need to think about it
  • changing how the flow numbers are displayed is trivial (albeit with a bunch of knock-on log-grep affects in the tests)

hjoliver avatar Dec 01 '23 02:12 hjoliver

After extensive changes and rebasing onto master, I've got the tests passing again, except for tests/integration/tui/test_app.py::test_auto_expansion which I think might indicate a missed datastore update on this branch.

Of the extra stuff, from @oliver-sanders above:

  • [x] 1. I don't think "set" is working with globs?
  • [x] 2. Workflow can prematurely shut down with --pre=succeeded
  • [x] 3. Incomplete tasks are not run.
  • [x] 4. Expired task state.
  • [x] ~~5. Workflow hangs after spawning parentless task.~~ [INVALID - see below]
  • [x] 6. Workflow stops prematurely when spawning parentless task
  1. Workflow hangs after spawning parentless task ... But strangely, none of those waiting tasks ever get run and the workflow hangs?! The runahead limit appears to get updated correctly, not sure what the cause is?

User error - the tasks are spawned way beyond their wall_clock triggers!

hjoliver avatar Dec 06 '23 02:12 hjoliver

Current status of this branch: I believe it implements the cylc set proposal, and it addresses all the issues raised in review by Oliver.

If so, and if no further problems arise in testing, then the only remaining problem is that a bunch of new functional tests should be converted to maximally simply "canonical" example functional tests, with coverage achieved independently (of the canonical tests) by unit and integration tests.

I could attempt to do that on this PR branch, or - given the urgency of getting 8.3.0 done - I could do the test changes as a follow-up. What's your preference @oliver-sanders ?

hjoliver avatar Dec 08 '23 11:12 hjoliver

Damn, what a mission. @oliver-sanders - all comments addressed and tests passing now.

I still need to take a closer look at coverage and shift some tests out of functional, but realistically that might take a bit of work. How do you feel about me doing that as a follow-up, so you can start building your part on top of this?

hjoliver avatar Dec 15 '23 03:12 hjoliver