cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Spurious "Not respawning... task was removed" log messages

Open MetRonnie opened this issue 4 months ago • 3 comments

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2024-01-01
    [[graph]]
        PT3H = get_observations => consolidate_observations

        +PT6H/PT6H = """
            consolidate_observations => forecast
            consolidate_observations[-PT3H] => forecast
            consolidate_observations[-PT6H] => forecast
            get_rainfall => forecast => post_process
        """

        +PT12H/PT6H = """
            forecast[-PT6H] => forecast
        """
$ cylc play simple-datetime --pause
...
$ cylc stop simple-datetime
...
$ cylc play simple-datetime --pause --no-detach
...
INFO - LOADING task proxies
INFO - + 20240101T0000+0100/get_observations waiting
INFO - [20240101T0000+0100/get_observations:waiting(runahead)] => waiting
INFO - [20240101T0000+0100/get_observations:waiting] => waiting(queued)
INFO - Not respawning 20240101T0300+0100/get_observations - task was removed
INFO - Not respawning 20240101T0600+0100/get_observations - task was removed
INFO - Not respawning 20240101T0900+0100/get_observations - task was removed
INFO - Not respawning 20240101T1200+0100/get_observations - task was removed
INFO - Not respawning 20240101T1500+0100/get_observations - task was removed
INFO - + 20240101T0300+0100/get_observations waiting
INFO - [20240101T0300+0100/get_observations:waiting(runahead)] => waiting
INFO - [20240101T0300+0100/get_observations:waiting] => waiting(queued)
...

MetRonnie avatar Oct 14 '25 16:10 MetRonnie

This log message went from DEBUG to INFO in the group-trigger PR.

https://github.com/cylc/cylc-flow/pull/6395#discussion_r2108816220

I think the issue was punted to https://github.com/cylc/cylc-flow/issues/6813, but the log message wasn't changed by that PR.

I don't think the log message makes sense, especially in this example where remove is not even involved. Suggest removing it.

However, this demonstrates that the code comments are bunk:

            # If itask has any history in this flow but no completed outputs
            # we can infer it has just been deliberately removed (N.B. not
            # by `cylc remove`), so don't immediately respawn it.
            # TODO (follow-up work):
            # - this logic fails if task removed after some outputs completed
            LOG.info(f"Not respawning {point}/{name} - task was removed")

Which is worrying!

Ping @hjoliver

oliver-sanders avatar Oct 15 '25 14:10 oliver-sanders

[UPDATE] I've understood the problem. It is inconsequential apart from these spurious log messages for non-sequential parentless waiting tasks at restart, due to trying to auto-spawn each one out to the runahead limit again - that was already done prior to shutdown.


The logic is correct for the case described in the code comment, but neglects to account for the situation in this example.

Correct:

  • try to spawn a new task proxy
  • that task has a recorded history in the DB but no completed outputs
  • therefore it must have been previously spawned but "removed" before it could run
  • it must have been removed by suicide trigger via another dependency
    • (not by cylc remove which would have erased the prior status)
  • therefore we should not respawn it

Neglected:

  • ~restart after shutting down with waiting tasks requires spawning waiting tasks like this~
  • restart after shutting down with auto-spawned parentless tasks triggers this logic, because each time we resurrect one of these tasks we try to auto-spawn it again out to the runahead limit
    • so this code block is possibly preventing a bug (re-adding the same task to the pool multiple times, not sure if that would have bad effects or not) although the log message is misleading
    • it does not cause needed tasks to be omitted, because it only comes from attempting to auto-spawn parentless tasks that are going to be resurrected (at restart) from the DB anyway

Simpler example:

[scheduler]
    allow implicit tasks = True
[scheduling]
    cycling mode = integer
    runahead limit = P0
    [[graph]]
        P1 = foo
  1. cylc vip --pause --no-detach - task pool:
1/foo:waiting (queued)
2/foo:waiting (runahead)
  1. cylc stop - DB:
1/foo:waiting
2/foo:waiting
  1. `cylc play --pause --no-detach
  • (a) resurrect 1/foo from DB, try to autospawn 2/foo (refuse to do that, log the message)
  • (b) resurrect '2/foo` from DB - good to go

With the default 5 cycle runhead limit we get an ever decreasing series of these log messages for each resurrected non-sequential waiting parentless task:

  • resurrect 1, log 2, 3, 4, 5, 6 (from trying to auto-spawn 2,...,6)
  • resurrect 2, log 3, 4, 5, 6
  • resurrect 3, log 4, 5, 6
  • resurrect 4, log 5, 6
  • resurrect 5, log 6
  • resurrect 6 - good to go

hjoliver avatar Oct 17 '25 02:10 hjoliver

I'll get a fix up tomorrow.

hjoliver avatar Oct 19 '25 23:10 hjoliver