Spurious "Not respawning... task was removed" log messages
[scheduler]
allow implicit tasks = True
[scheduling]
initial cycle point = 2024-01-01
[[graph]]
PT3H = get_observations => consolidate_observations
+PT6H/PT6H = """
consolidate_observations => forecast
consolidate_observations[-PT3H] => forecast
consolidate_observations[-PT6H] => forecast
get_rainfall => forecast => post_process
"""
+PT12H/PT6H = """
forecast[-PT6H] => forecast
"""
$ cylc play simple-datetime --pause
...
$ cylc stop simple-datetime
...
$ cylc play simple-datetime --pause --no-detach
...
INFO - LOADING task proxies
INFO - + 20240101T0000+0100/get_observations waiting
INFO - [20240101T0000+0100/get_observations:waiting(runahead)] => waiting
INFO - [20240101T0000+0100/get_observations:waiting] => waiting(queued)
INFO - Not respawning 20240101T0300+0100/get_observations - task was removed
INFO - Not respawning 20240101T0600+0100/get_observations - task was removed
INFO - Not respawning 20240101T0900+0100/get_observations - task was removed
INFO - Not respawning 20240101T1200+0100/get_observations - task was removed
INFO - Not respawning 20240101T1500+0100/get_observations - task was removed
INFO - + 20240101T0300+0100/get_observations waiting
INFO - [20240101T0300+0100/get_observations:waiting(runahead)] => waiting
INFO - [20240101T0300+0100/get_observations:waiting] => waiting(queued)
...
This log message went from DEBUG to INFO in the group-trigger PR.
https://github.com/cylc/cylc-flow/pull/6395#discussion_r2108816220
I think the issue was punted to https://github.com/cylc/cylc-flow/issues/6813, but the log message wasn't changed by that PR.
I don't think the log message makes sense, especially in this example where remove is not even involved. Suggest removing it.
However, this demonstrates that the code comments are bunk:
# If itask has any history in this flow but no completed outputs
# we can infer it has just been deliberately removed (N.B. not
# by `cylc remove`), so don't immediately respawn it.
# TODO (follow-up work):
# - this logic fails if task removed after some outputs completed
LOG.info(f"Not respawning {point}/{name} - task was removed")
Which is worrying!
Ping @hjoliver
[UPDATE] I've understood the problem. It is inconsequential apart from these spurious log messages for non-sequential parentless waiting tasks at restart, due to trying to auto-spawn each one out to the runahead limit again - that was already done prior to shutdown.
The logic is correct for the case described in the code comment, but neglects to account for the situation in this example.
Correct:
- try to spawn a new task proxy
- that task has a recorded history in the DB but no completed outputs
- therefore it must have been previously spawned but "removed" before it could run
- it must have been removed by suicide trigger via another dependency
- (not by
cylc removewhich would have erased the prior status)
- (not by
- therefore we should not respawn it
Neglected:
- ~restart after shutting down with waiting tasks requires spawning waiting tasks like this~
- restart after shutting down with auto-spawned parentless tasks triggers this logic, because each time we resurrect one of these tasks we try to auto-spawn it again out to the runahead limit
- so this code block is possibly preventing a bug (re-adding the same task to the pool multiple times, not sure if that would have bad effects or not) although the log message is misleading
- it does not cause needed tasks to be omitted, because it only comes from attempting to auto-spawn parentless tasks that are going to be resurrected (at restart) from the DB anyway
Simpler example:
[scheduler]
allow implicit tasks = True
[scheduling]
cycling mode = integer
runahead limit = P0
[[graph]]
P1 = foo
-
cylc vip --pause --no-detach- task pool:
1/foo:waiting (queued)
2/foo:waiting (runahead)
-
cylc stop- DB:
1/foo:waiting
2/foo:waiting
- `cylc play --pause --no-detach
- (a) resurrect
1/foofrom DB, try to autospawn2/foo(refuse to do that, log the message) - (b) resurrect '2/foo` from DB - good to go
With the default 5 cycle runhead limit we get an ever decreasing series of these log messages for each resurrected non-sequential waiting parentless task:
- resurrect 1, log 2, 3, 4, 5, 6 (from trying to auto-spawn 2,...,6)
- resurrect 2, log 3, 4, 5, 6
- resurrect 3, log 4, 5, 6
- resurrect 4, log 5, 6
- resurrect 5, log 6
- resurrect 6 - good to go
I'll get a fix up tomorrow.