flux-core
flux-core copied to clipboard
flux.service: State 'stop-sigterm' timed out. Killing.
Problem: running systemctl stop flux on rank 0 with the throughput test running on 5 compute nodes timed out and killed the rank 0 broker.
There was a content store back end backlog of 10-20K blobs last time I checked, wtih content.sqlite on an NFS file system.
journalctl -u flux shows
Feb 24 10:34:11 picl0 systemd[1]: flux.service: State 'stop-sigterm' timed out. Killing.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1640 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 46988 (rc3) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 47002 (flux) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1642 (ZMQbg/IO/0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1643 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:11 picl0 systemd[1]: flux.service: Killing process 1653 (flux-broker-0) with signal SIGKILL.
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Main process exited, code=killed, status=9/KILL
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Failed with result 'timeout'.
Feb 24 10:34:12 picl0 systemd[1]: Stopped Flux message broker.
Feb 24 10:34:12 picl0 systemd[1]: flux.service: Consumed 6h 40min 49.839s CPU time.
Flux was apparently not restarted automatically.
On restart safe mode was entered.
Couple of good comments from @grondo on slack, captured before they scroll off into oblivion:
- There is a TimeoutStopSec directive in systemd.service
- "infinity" is a valid value
- I wonder if we should look into adding support for sd_notify(3) at some point
- a service can request an extended timeout via that interface