deploy-rs
deploy-rs copied to clipboard
system activation script can sometimes get killed in a middle of an activation
I have again encountered a situation where a system didn't switch to a new configuration fully, leaving system in a broken state.
In particular, with successful deployments I can see
nixos[1743212]: switching to system configuration /nix/store/xjb20xl356aizawaicbfyhqa2jpkvf50-nixos-system-hostname-22.05.20220415.d3b68c6
nixos[1743212]: finished switching to system configuration /nix/store/xjb20xl356aizawaicbfyhqa2jpkvf50-nixos-system-hostname-22.05.20220415.d3b68c6
In the particular case where things went, however, I don't see any messages indicating a success:
nixos[1569487]: switching to system configuration /nix/store/kdaj8fap7s7g20rly005kax18n007hgl-nixos-system-hostname-22.05.20220415.d3b68c6
...
There are a couple of things that could have contributed to it:
-
This machine was activating a new profile for itself;
-
This machine was being accessed over
sshoverwireguardover network. All of these were being restarted as part of the activation:systemd[1]: Stopped Dnsmasq Daemon. systemd[1]: Stopped Address configuration of vmbr0. systemd[1]: Stopped Bridge Interface vmbr0. systemd[1]: Stopped WireGuard Tunnel - wg0. -
For this particular activation systemd got reexecuted:
systemd[1]: Reexecuting. systemd[1]: systemd 250.4 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LI> systemd[1]: Detected architecture x86-64. systemd[1570147]: /nix/store/4sjmk6209x5c6ns3b7193qpq03r7m8wv-systemd-250.4/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1. systemd[1]: polkit.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: sshd@17-::1:22-::1:57492.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: [email protected]:22-192.168.2.3:59828.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: systemd-journald.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: systemd-logind.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: [email protected]: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: dbus.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: [email protected]: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: nix-daemon.service: Current command vanished from the unit file, execution of the command list won't be resumed. systemd[1]: Reloading. -
At this point the newly executed systemd went on to deactivate more units – including the SSH connection which was driving the activation
systemd[1]: sshd@17-::1:22-::1:57492.service: Deactivated successfully. systemd[1]: Stopped SSH Daemon.before it starts some basic units, but not everything that was stopped.
I imagine that systemd does that thing here where it it terminates the entire cgroup very thoroughly and that the activation script actually ends up in the scope of sshd@17-::1:22-::1:57492.service, and gets terminated alongside the entire.
As far as I know activate-rs today invokes the activation script directly, while setting up SIGHUP to prevent itself getting killed in some instances. I suspect there are a couple of things that could help to make this a little more resilient:
- Start ignoring
SIGTERMwhile the activation script is running – systemd will wait for a little while before sending out aSIGKILL. This grace time might be sufficient to complete the (re-)activation; - Invoke the activation script with
systemd-run– this could potentially avoid issues with current command vanishing from the unit file, and also helps with things like sighup more holistically.- In order to make rollbacks work reliably, it may be necessary to invoke a “guard” unit via
systemd-runas well which would monitor the execution of the firstsystemd-runinvocation.
- In order to make rollbacks work reliably, it may be necessary to invoke a “guard” unit via