flux-core
flux-core copied to clipboard
sdexec: error signaling job tasks causes transient job service unit to exit without removing cgroup
On tuolumne, we're seeing sets of drained nodes with 'unkillable processes' even though there are no processes running when admins investigate after the fact.
In one instance, a job was canceled at 11:57 and nodes were drained after the job-exec timeout at 12:12. Note that this indicates the job-exec module still thought the sdexec launched subprocesses were active at this time.
On one of the drained nodes, a log for the transient job service unit was obtained from journactl
(note that you run this as root not the flux
user)
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Main process exited, code=exited, status=137/n/a
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/[email protected]/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/[email protected]/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group `, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/[email protected]/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed with result 'exit-code'.
I appears that the unit exited immediately (Failed with result 'exit-code'
). Also the cgroup /user.slice/user-767.slice/[email protected]/imp-shell-168-fD4g2D7SS5D.service
remains on the system, even though there are no processes in it.
This is probably related to #6011.
Also related, perhaps we need to set TimeoutStopSec to infinity
so that system will wait until all processes in the cgroup exit before considering the unit stopped/exited.