flux-core flux.service: State 'stop-sigterm' timed out. Killing.

flux.service: State 'stop-sigterm' timed out. Killing.

Open garlick opened this issue 3 years ago • 2 comments

Problem: running systemctl stop flux on rank 0 with the throughput test running on 5 compute nodes timed out and killed the rank 0 broker.

There was a content store back end backlog of 10-20K blobs last time I checked, wtih content.sqlite on an NFS file system.

journalctl -u flux shows

Feb 24 10:34:11 picl0 systemd[1]: flux.service: State 'stop-sigterm' timed out. Killing.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1640 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 46988 (rc3) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 47002 (flux) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1642 (ZMQbg/IO/0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1643 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1653 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Main process exited, code=killed, status=9/KILL
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Failed with result 'timeout'.
Feb 24 10:34:12 picl0 systemd[1]: Stopped Flux message broker.
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Consumed 6h 40min 49.839s CPU time.

Flux was apparently not restarted automatically.

On restart safe mode was entered.

Feb 24 '22 18:02 garlick

Full chatty log attached.

picl0-crash-log.txt

Feb 24 '22 18:02 garlick

Couple of good comments from @grondo on slack, captured before they scroll off into oblivion:

There is a TimeoutStopSec directive in systemd.service
"infinity" is a valid value
I wonder if we should look into adding support for sd_notify(3) at some point
a service can request an extended timeout via that interface

Feb 24 '22 23:02 garlick

flux-core flux-core copied to clipboard

flux.service: State 'stop-sigterm' timed out. Killing.

flux-core
flux-core copied to clipboard