`cylc stop --now` can wait for several minutes for job logs retrieval to take place in some cases
Description
When stopping a couple of big NWP workflows that had been running for ~ a dozen days, they took several minutes to actually shut down after the cylc stop --now command was received.
In the example below, the timing of the job logs retrieve command returning followed by the logging of "Workflow shutting down" is suspicious.
Logs
2025-10-06T14:34:04Z INFO - [20251006T0900Z/enuk_um_fcst_013/01:running] => succeeded
2025-10-06T14:34:04Z INFO - [20251006T2100Z/enuk_um_fcst_013:waiting(runahead)] => waiting
2025-10-06T14:34:14Z INFO - Command "stop" received. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
stop(mode=REQUEST(NOW))
2025-10-06T14:34:14Z INFO - Command "stop" actioned. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_jules_1/01:running] => succeeded
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_ekf:waiting(runahead)] => waiting
2025-10-06T14:34:49Z INFO - [20251006T1100Z/ukv_surf_jules_0/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_014/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_end:waiting] => waiting(queued)
2025-10-06T14:34:49Z INFO - [20251006T2100Z/enuk_um_fcst_014:waiting(runahead)] => waiting
2025-10-06T14:36:52Z INFO - Command "set_verbosity" received. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
set_verbosity(level=10)
2025-10-06T14:36:53Z DEBUG - Processing 1 queued command(s)
2025-10-06T14:36:53Z INFO - Command "set_verbosity" actioned. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
2025-10-06T14:37:37Z DEBUG - P30D inactivity timer starts NOW
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] (received)succeeded
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] => succeeded
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] Queueing succeeded handler: os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] added to the n=0 window
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] => waiting
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting] => waiting(queued)
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_user_hook:waiting] => waiting(queued)
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] removed from the n=0 window: completed
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - 20251006T1000Z/ukv_ops_process_analysis_radarz/01 handler:job-logs-retrieve for task event:succeeded will run after PT10S (after 2025-10-06T14:37:47Z)
2025-10-06T14:37:38Z DEBUG - os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:38Z DEBUG - [(('event-handler-00', 'succeeded'), 1) cmd] os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
[(('event-handler-00', 'succeeded'), 1) ret_code] 0
[(('event-handler-00', 'succeeded'), 1) out] [INFO] 4 arguments - task event
2025-10-06T14:37:50Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T1000Z/ukv_ops_process_analysis_radarz ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
[TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
[TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
receiving incremental file list
20251006T1000Z/ukv_ops_process_analysis_radarz/
20251006T1000Z/ukv_ops_process_analysis_radarz/01/
20251006T1000Z/ukv_ops_process_analysis_radarz/01/job
20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.err
20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.out
20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.stats
20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.status
sent 356 bytes received 247,288 bytes 165,096.00 bytes/sec
2025-10-06T14:37:55Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T0900Z/enuk_um_fcst_013 ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
[TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
[TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
receiving incremental file list
20251006T0900Z/enuk_um_fcst_013/01/
20251006T0900Z/enuk_um_fcst_013/01/job.err
20251006T0900Z/enuk_um_fcst_013/01/job.out
sent 221 bytes received 16,410,397 bytes 10,940,412.00 bytes/sec
2025-10-06T14:37:55Z INFO - Workflow shutting down - REQUEST(NOW)
2025-10-06T14:37:55Z WARNING - Orphaned tasks:
* 20251006T1000Z/enuk_um_fcst_016 (running)
* 20251006T1000Z/enuk_um_fcst_017 (running)
* 20251006T1000Z/enuk_um_fcst_015 (running)
* 20251006T1100Z/ukv_ops_process_background_radarz (running)
2025-10-06T14:37:55Z DEBUG - stopping zmq replier...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - stopping zmq publisher...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - auth received API command b'TERMINATE'
2025-10-06T14:37:55Z INFO - platform: <platform-login-node> - remote tidy (on <platform-login-node>)
2025-10-06T14:37:55Z INFO - platform: excd - remote tidy (on <platform-login-node>)
2025-10-06T14:37:57Z DEBUG - Removing authentication keys from scheduler
2025-10-06T14:37:57Z DEBUG - os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
2025-10-06T14:37:58Z INFO - [('workflow-event-handler-00', 'shutdown') cmd] os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
[('workflow-event-handler-00', 'shutdown') ret_code] 0
[('workflow-event-handler-00', 'shutdown') out] [INFO] 3 arguments - workflow event
2025-10-06T14:37:58Z INFO - DONE
Reproducible Example
Edit: I've managed to come up with a repro:
# global.cylc
[platforms]
[[login.remote.hpc]]
retrieve job logs = True
retrieve job logs retry delays = PT40S
# flow.cylc
[scheduling]
[[graph]]
R1 = foo => bar
[runtime]
[[foo]]
platform = login.remote.hpc
[[bar]]
script = sleep 60
[[[events]]]
started handlers = cylc stop %(workflow)s --now
There is an issue here where handlers that are due to be run in the future hold up the shutdown. Whereas I think the --now option x1 should only wait for already-running handlers to finish.
I don't think this is a bug.
Use the --now option once and it will wait for various things.
Use it twice for a more immediate shutdown. I think this will probs skip retrieval.
CLI docs:
-n, --now Shut down without waiting for active tasks to
complete. If this option is specified once, wait for
task event handler, job poll/kill to complete. If this
option is specified more than once, tell the workflow
to terminate immediately.
GUI docs:
--now
Immediate shutdown
* Don't kill submitted or running jobs.
* Run workflow event handlers and wait for them to complete.
--now --now
mmediate shutdown
* Don't kill submitted or running jobs.
* Don't run event handlers.
Neither set of docs mentions job log retrial, however, I assume this goes with event handlers?
Job log retrieval seems to be a type of event handler, but what I think is a bug is that it waits for all pending job log retrievals that are scheduled to happen rather than just ones that are in progress. The ones that are scheduled to happen but don't due to the use of --now --now will just happen on restart.
My expectation is that --now --now is only for if you have long-running event handlers that you want to cancel (not future-scheduled event handlers).
We think this is working as intended / documented.
But we can still change it of course.
It's clearer the way it is, "run event handlers and wait for them to complete", treating job log handlers differently to say, workflow shutdown handlers (which is also a "future" event, not in progress at the time) would be very odd. But I don't think the difference will matter to many people, so not too fussed.