flux-core
flux-core copied to clipboard
systemd user instance exits with error, jobs hang in start if using the node
This sequence of events were logged in the systemd user instance for user flux on a node:
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Main process exited, code=killed, status=9/KILL
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed to kill control group /user.slice/user-765.slice/[email protected]/imp-shell-1686-f27wRf3NeJbH.service, ignoring: Operation not permitted
Jul 16 07:48:01 systemd[35140]: imp-shell-1686-f27wRf3NeJbH.service: Failed with result 'signal'.
Jul 16 07:48:18 systemd[1]: [email protected]: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:48:18 systemd[1]: [email protected]: Killing process 35166 (dbus-daemon) with signal SIGKILL.
Jul 16 07:50:19 systemd[1]: [email protected]: Processes still around after SIGKILL. Ignoring.
Jul 16 07:50:19 systemd[1]: [email protected]: Killing process 151743 (flux-shell) with signal SIGKILL.
Jul 16 07:52:19 systemd[1]: [email protected]: Processes still around after final SIGKILL. Entering failed mode.
Jul 16 07:52:19 systemd[1]: [email protected]: Failed with result 'timeout'.
During this time, a job using this node was started, and the job was stuck before the start event while sdbus tried to connect:
2024-07-16T17:48:51.900739Z sdbus.info[]: unix:path=/run/user/765/bus: No such file or directory (retrying in 60s)
2024-07-16T17:49:51.901256Z sdbus.info[]: unix:path=/run/user/765/bus: No such file or directory (retrying in 60s)
An attempt to manually start the user instance was made, but this failed:
Jul 16 10:45:22 systemd[1]: Starting User Manager for UID 765...
Jul 16 10:45:22 systemd[1]: [email protected]: Failed with result 'protocol'.
Jul 16 10:45:22 systemd[1]: Failed to start User Manager for UID 765.
A subsequent attempt succeeded though:
Jul 16 10:50:09 systemd[1]: Starting User Manager for UID 765...
Jul 16 10:50:09 systemd[155640]: Starting D-Bus User Message Bus Socket.
Jul 16 10:50:09 systemd[155640]: Reached target Timers.
Jul 16 10:50:09 systemd[155640]: Reached target Paths.
Jul 16 10:50:09 systemd[155640]: Listening on Sound System.
Jul 16 10:50:09 systemd[155640]: Listening on Multimedia System.
Jul 16 10:50:09 systemd[155640]: Listening on D-Bus User Message Bus Socket.
Jul 16 10:50:09 systemd[155640]: Reached target Sockets.
Jul 16 10:50:09 systemd[155640]: Reached target Basic System.
Jul 16 10:50:09 systemd[1]: Started User Manager for UID 765.