flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

systemd user instance exits with error, jobs hang in start if using the node

Open grondo opened this issue 1 year ago • 0 comments

This sequence of events were logged in the systemd user instance for user flux on a node:

Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Main process exited, code=killed, status=9/KILL
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01  systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed with result 'signal'.
Jul 16 07:48:18  systemd[1]: [email protected]: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:18  systemd[1]: [email protected]: Killing process 35166 (dbus-daemon) with signal SIGKILL.
Jul 16 07:50:19  systemd[1]: [email protected]: Processes still around after SIGKILL. Ignoring.
Jul 16 07:50:19  systemd[1]: [email protected]: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:52:19  systemd[1]: [email protected]: Processes still around after final SIGKILL. Entering failed mode.
Jul 16 07:52:19  systemd[1]: [email protected]: Failed with result 'timeout'.

During this time, a job using this node was started, and the job was stuck before the start event while sdbus tried to connect:

2024-07-16T17:48:51.900739Z sdbus.info[]: unix:path=/run/user/765/bus: No such file or directory (retrying in 60s)
2024-07-16T17:49:51.901256Z sdbus.info[]: unix:path=/run/user/765/bus: No such file or directory (retrying in 60s)

An attempt to manually start the user instance was made, but this failed:

Jul 16 10:45:22 systemd[1]: Starting User Manager for UID 765...
Jul 16 10:45:22 systemd[1]: [email protected]: Failed with result 'protocol'.
Jul 16 10:45:22 systemd[1]: Failed to start User Manager for UID 765.

A subsequent attempt succeeded though:

Jul 16 10:50:09 systemd[1]: Starting User Manager for UID 765...
Jul 16 10:50:09 systemd[155640]: Starting D-Bus User Message Bus Socket.
Jul 16 10:50:09 systemd[155640]: Reached target Timers.
Jul 16 10:50:09 systemd[155640]: Reached target Paths.
Jul 16 10:50:09 systemd[155640]: Listening on Sound System.
Jul 16 10:50:09 systemd[155640]: Listening on Multimedia System.
Jul 16 10:50:09 systemd[155640]: Listening on D-Bus User Message Bus Socket.
Jul 16 10:50:09 systemd[155640]: Reached target Sockets.
Jul 16 10:50:09 systemd[155640]: Reached target Basic System.
Jul 16 10:50:09 systemd[1]: Started User Manager for UID 765.

grondo avatar Jul 16 '24 18:07 grondo