cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

`cylc stop --now` can wait for several minutes for job logs retrieval to take place in some cases

Open MetRonnie opened this issue 4 months ago • 3 comments

Description

When stopping a couple of big NWP workflows that had been running for ~ a dozen days, they took several minutes to actually shut down after the cylc stop --now command was received.

In the example below, the timing of the job logs retrieve command returning followed by the logging of "Workflow shutting down" is suspicious.

Logs

2025-10-06T14:34:04Z INFO - [20251006T0900Z/enuk_um_fcst_013/01:running] => succeeded
2025-10-06T14:34:04Z INFO - [20251006T2100Z/enuk_um_fcst_013:waiting(runahead)] => waiting
2025-10-06T14:34:14Z INFO - Command "stop" received. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
    stop(mode=REQUEST(NOW))
2025-10-06T14:34:14Z INFO - Command "stop" actioned. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_jules_1/01:running] => succeeded
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_ekf:waiting(runahead)] => waiting
2025-10-06T14:34:49Z INFO - [20251006T1100Z/ukv_surf_jules_0/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_014/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_end:waiting] => waiting(queued)
2025-10-06T14:34:49Z INFO - [20251006T2100Z/enuk_um_fcst_014:waiting(runahead)] => waiting
2025-10-06T14:36:52Z INFO - Command "set_verbosity" received. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
    set_verbosity(level=10)
2025-10-06T14:36:53Z DEBUG - Processing 1 queued command(s)
2025-10-06T14:36:53Z INFO - Command "set_verbosity" actioned. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
2025-10-06T14:37:37Z DEBUG - P30D inactivity timer starts NOW
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] (received)succeeded
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] => succeeded
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] Queueing succeeded handler: os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] added to the n=0 window
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] => waiting
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting] => waiting(queued)
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_user_hook:waiting] => waiting(queued)
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] removed from the n=0 window: completed
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - 20251006T1000Z/ukv_ops_process_analysis_radarz/01 handler:job-logs-retrieve for task event:succeeded will run after PT10S (after 2025-10-06T14:37:47Z)
2025-10-06T14:37:38Z DEBUG - os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:38Z DEBUG - [(('event-handler-00', 'succeeded'), 1) cmd] os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
    [(('event-handler-00', 'succeeded'), 1) ret_code] 0
    [(('event-handler-00', 'succeeded'), 1) out] [INFO] 4 arguments - task event
2025-10-06T14:37:50Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T1000Z/ukv_ops_process_analysis_radarz ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
    receiving incremental file list
    20251006T1000Z/ukv_ops_process_analysis_radarz/
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.err
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.out
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.stats
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.status
    sent 356 bytes  received 247,288 bytes  165,096.00 bytes/sec
2025-10-06T14:37:55Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T0900Z/enuk_um_fcst_013 ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
    receiving incremental file list
    20251006T0900Z/enuk_um_fcst_013/01/
    20251006T0900Z/enuk_um_fcst_013/01/job.err
    20251006T0900Z/enuk_um_fcst_013/01/job.out
    sent 221 bytes  received 16,410,397 bytes  10,940,412.00 bytes/sec
2025-10-06T14:37:55Z INFO - Workflow shutting down - REQUEST(NOW)
2025-10-06T14:37:55Z WARNING - Orphaned tasks:
    * 20251006T1000Z/enuk_um_fcst_016 (running)
    * 20251006T1000Z/enuk_um_fcst_017 (running)
    * 20251006T1000Z/enuk_um_fcst_015 (running)
    * 20251006T1100Z/ukv_ops_process_background_radarz (running)
2025-10-06T14:37:55Z DEBUG - stopping zmq replier...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - stopping zmq publisher...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - auth received API command b'TERMINATE'
2025-10-06T14:37:55Z INFO - platform: <platform-login-node> - remote tidy (on <platform-login-node>)
2025-10-06T14:37:55Z INFO - platform: excd - remote tidy (on <platform-login-node>)
2025-10-06T14:37:57Z DEBUG - Removing authentication keys from scheduler
2025-10-06T14:37:57Z DEBUG - os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
2025-10-06T14:37:58Z INFO - [('workflow-event-handler-00', 'shutdown') cmd] os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
    [('workflow-event-handler-00', 'shutdown') ret_code] 0
    [('workflow-event-handler-00', 'shutdown') out] [INFO] 3 arguments - workflow event
2025-10-06T14:37:58Z INFO - DONE

Reproducible Example

Edit: I've managed to come up with a repro:

# global.cylc
[platforms]
    [[login.remote.hpc]]
        retrieve job logs = True
        retrieve job logs retry delays = PT40S
# flow.cylc
[scheduling]
    [[graph]]
        R1 = foo => bar

[runtime]
    [[foo]]
        platform = login.remote.hpc
    [[bar]]
        script = sleep 60
        [[[events]]]
            started handlers = cylc stop %(workflow)s --now

There is an issue here where handlers that are due to be run in the future hold up the shutdown. Whereas I think the --now option x1 should only wait for already-running handlers to finish.

MetRonnie avatar Oct 06 '25 16:10 MetRonnie

I don't think this is a bug.

Use the --now option once and it will wait for various things.

Use it twice for a more immediate shutdown. I think this will probs skip retrieval.


CLI docs:

  -n, --now             Shut down without waiting for active tasks to
                        complete. If this option is specified once, wait for
                        task event handler, job poll/kill to complete. If this
                        option is specified more than once, tell the workflow
                        to terminate immediately.

GUI docs:

--now

Immediate shutdown
 * Don't kill submitted or running jobs.
 *  Run workflow event handlers and wait for them to complete.
--now --now

mmediate shutdown
 * Don't kill submitted or running jobs.
 * Don't run event handlers.

Neither set of docs mentions job log retrial, however, I assume this goes with event handlers?

oliver-sanders avatar Oct 07 '25 09:10 oliver-sanders

Job log retrieval seems to be a type of event handler, but what I think is a bug is that it waits for all pending job log retrievals that are scheduled to happen rather than just ones that are in progress. The ones that are scheduled to happen but don't due to the use of --now --now will just happen on restart.

My expectation is that --now --now is only for if you have long-running event handlers that you want to cancel (not future-scheduled event handlers).

MetRonnie avatar Oct 08 '25 09:10 MetRonnie

We think this is working as intended / documented.

But we can still change it of course.

It's clearer the way it is, "run event handlers and wait for them to complete", treating job log handlers differently to say, workflow shutdown handlers (which is also a "future" event, not in progress at the time) would be very odd. But I don't think the difference will matter to many people, so not too fussed.

oliver-sanders avatar Oct 15 '25 13:10 oliver-sanders