cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Prevent task events manager logging the same message more than once

Open wxtim opened this issue 2 years ago • 26 comments

Closes #5526

Deals with a bug was returning every message ever sent by a task on each poll.

(Draft pending further checks, to ensure that I haven't broken anything)

Check List

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • [x] Tests are included
  • [x] CHANGES.md entry included if this is a change that can affect users
  • [x] ~Update docs~ - this is a change to Cylc's logging.
  • [x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

wxtim avatar Jun 01 '23 09:06 wxtim

@metronnie - I think that I fixed it.

wxtim avatar Jun 06 '23 12:06 wxtim

The original approach (adding messages to the activity log file and checking for them later seemed to break some of the workflow restart tests in ways that make me un-confident that they are safe changes.

Scraping the log is horrible.

Next thing to try: using the database - this isn't straightforward, because the database doesn't contain a table which would easily contain the task messages.

wxtim avatar Jun 07 '23 08:06 wxtim

@oliver-sanders - I've left this in a draft state - could I have your thoughts on the issue around restart?

wxtim avatar Jun 15 '23 15:06 wxtim

@oliver-sanders - I've left this in a draft state - could I have your thoughts on the issue around restart?

I don't see what the issue is with restart? The intent of this PR is to prevent task messages being counted more than once. It shouldn't matter whether a restart happened in between two messages.

oliver-sanders avatar Jun 26 '23 13:06 oliver-sanders

@oliver-sanders - I've left this in a draft state - could I have your thoughts on the issue around restart?

I don't see what the issue is with restart? The intent of this PR is to prevent task messages being counted more than once. It shouldn't matter whether a restart happened in between two messages.

No, I agree now, I just forgot to un-draft after I got it working.

wxtim avatar Jun 27 '23 14:06 wxtim

Good news, messages are stored in the DB along with timestamps so we don't need to use the job-activity.log to hold this info.

Just make sure that queued database operations get processed between calls of whatever calls this code (presumably Scheduler.process_queued_task_messages) so that task messages are committed to the DB from the last batch before we process the next batch. This is probably already the case, but well worth checking.

oliver-sanders avatar Jun 28 '23 09:06 oliver-sanders

Last thing is that the started (polled) still piles up in the UI if you poll repeatedly image

Rah!

wxtim avatar Jul 11 '23 07:07 wxtim

For the tests/f/restart/58-removed-task.t test failure you're clear to bodge the test to accept the non-polled message as it isn't trying to test polling logic, it's trying to test task_pool logic.

oliver-sanders avatar Jul 11 '23 13:07 oliver-sanders

I've noticed that the polled messages on restart do not show anymore (screenshot is from master branch). Is that desirable?

That is not desirable, however, it's a different problem - https://github.com/cylc/cylc-flow/issues/5215

The data store needs to load outputs from the database on restart/reload.

oliver-sanders avatar Jul 12 '23 10:07 oliver-sanders

Actually I've just tested this with the UI and unfortunately no messages come through while the workflow is running, but when you stop it, duplicate messages can come through

MetRonnie avatar Jul 25 '23 17:07 MetRonnie

Actually I've just tested this with the UI and unfortunately no messages come through while the workflow is running, but when you stop it, duplicate messages can come through

As it currently is, or with your changes?

wxtim avatar Jul 26 '23 07:07 wxtim

As it is currently. I have been testing with this workflow

[scheduling]
    [[graph]]
        R1 = foo
[runtime]
    [[foo]]
        script = sleep 15; cylc message -- 'MEOW: Hello world'; sleep 600
  1. Play
  2. After 15 secs "Hello World" chip should show in tree view
  3. Poll a couple of times to see if message is repeated
  4. Stop --now
  5. Restart, message should show up again (once) as polled
  6. Poll again
  7. Stop --kill

MetRonnie avatar Jul 26 '23 10:07 MetRonnie

@MetRonnie

As it currently is.

With all the changes since we started this convo?

wxtim avatar Jul 26 '23 11:07 wxtim

Not since you've rebased (btw if you could avoid rebasing it would make it easier to tell what changes you're making in between my reviews)

MetRonnie avatar Jul 26 '23 12:07 MetRonnie

Also I think you've incorrectly rebased onto master?

MetRonnie avatar Jul 26 '23 12:07 MetRonnie

(needs rebase onto 8.2.x to remove "bump dev version 8.3.0.dev" commit)

oliver-sanders avatar Aug 02 '23 08:08 oliver-sanders

5. Restart, message should show up again (once) as polled

Not happening at the moment

MetRonnie avatar Aug 03 '23 14:08 MetRonnie

  1. Restart, message should show up again (once) as polled

Not happening at the moment

I didn't think we were bothered about how the message was transmitted, so we don't want a second chip if polled.

wxtim avatar Aug 03 '23 15:08 wxtim

But we do want one chip, which is not the case on restart for a ongoing running task?

Also I've noticed this PR does not stop duplicate entries piling up in the DB when polling

$ select * from task_events
foo5|1|2023-08-04T10:43:29+01:00|1|submitted|
foo5|1|2023-08-04T10:43:31+01:00|1|started|
foo5|1|2023-08-04T10:43:46+01:00|1|message meow|Hello world
foo5|1|2023-08-04T10:43:46+01:00|1|message meow|Hello world
foo5|1|2023-08-04T10:43:46+01:00|1|message meow|Hello world

MetRonnie avatar Aug 04 '23 09:08 MetRonnie

But we do want one chip, which is not the case on restart for a ongoing running task?

I've lost track of the example this comment is coming from, but note that this PR will result in some message chips disappearing from the GUI on restart/realod. This isn't desired behaviour, but it is the correct logic!

The true source of the problem is here: https://github.com/cylc/cylc-flow/issues/5215, fixing this issue will restore the chip.

oliver-sanders avatar Aug 10 '23 15:08 oliver-sanders

Ok, other than duplicate entries in task_events DB table, I think this is good to go

MetRonnie avatar Aug 10 '23 15:08 MetRonnie

After discussing with Oliver we think there is still some unnecessary processing of already-received messages going on. I've opened a PR against this at https://github.com/wxtim/cylc/pull/57

Update: closed that as it does not seem feasible without a major refactor

MetRonnie avatar Aug 11 '23 10:08 MetRonnie

Update: closed that as it does not seem feasible without a major refactor

Not entirely surprised, but thank you for attempting it.

wxtim avatar Aug 22 '23 08:08 wxtim

@oliver-sanders are you waiting on me to do something with this?

wxtim avatar Sep 26 '23 16:09 wxtim

Kinda, but I need to take a closer look. I think this isn't skipping enough of the method, the problem is about more than just logging, e.g. it appears to running event handlers and spawning logic for duplicate messages? But I know that when you tried moving the return higher up the function it caused problems.

oliver-sanders avatar Sep 26 '23 17:09 oliver-sanders