deploy-rs icon indicating copy to clipboard operation
deploy-rs copied to clipboard

system activation script can sometimes get killed in a middle of an activation

Open nagisa opened this issue 3 years ago • 0 comments

I have again encountered a situation where a system didn't switch to a new configuration fully, leaving system in a broken state.

In particular, with successful deployments I can see

nixos[1743212]: switching to system configuration /nix/store/xjb20xl356aizawaicbfyhqa2jpkvf50-nixos-system-hostname-22.05.20220415.d3b68c6
nixos[1743212]: finished switching to system configuration /nix/store/xjb20xl356aizawaicbfyhqa2jpkvf50-nixos-system-hostname-22.05.20220415.d3b68c6

In the particular case where things went, however, I don't see any messages indicating a success:

nixos[1569487]: switching to system configuration /nix/store/kdaj8fap7s7g20rly005kax18n007hgl-nixos-system-hostname-22.05.20220415.d3b68c6
...

There are a couple of things that could have contributed to it:

  1. This machine was activating a new profile for itself;

  2. This machine was being accessed over ssh over wireguard over network. All of these were being restarted as part of the activation:

    systemd[1]: Stopped Dnsmasq Daemon.
    systemd[1]: Stopped Address configuration of vmbr0.
    systemd[1]: Stopped Bridge Interface vmbr0.
    systemd[1]: Stopped WireGuard Tunnel - wg0.
    
  3. For this particular activation systemd got reexecuted:

    systemd[1]: Reexecuting.
    systemd[1]: systemd 250.4 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LI>
    systemd[1]: Detected architecture x86-64.
    systemd[1570147]: /nix/store/4sjmk6209x5c6ns3b7193qpq03r7m8wv-systemd-250.4/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
    systemd[1]: polkit.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: sshd@17-::1:22-::1:57492.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: [email protected]:22-192.168.2.3:59828.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: systemd-journald.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: systemd-logind.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: [email protected]: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: dbus.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: [email protected]: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: nix-daemon.service: Current command vanished from the unit file, execution of the command list won't be resumed.
    systemd[1]: Reloading.
    
  4. At this point the newly executed systemd went on to deactivate more units – including the SSH connection which was driving the activation

    systemd[1]: sshd@17-::1:22-::1:57492.service: Deactivated successfully.
    systemd[1]: Stopped SSH Daemon.
    

    before it starts some basic units, but not everything that was stopped.

I imagine that systemd does that thing here where it it terminates the entire cgroup very thoroughly and that the activation script actually ends up in the scope of sshd@17-::1:22-::1:57492.service, and gets terminated alongside the entire.

As far as I know activate-rs today invokes the activation script directly, while setting up SIGHUP to prevent itself getting killed in some instances. I suspect there are a couple of things that could help to make this a little more resilient:

  • Start ignoring SIGTERM while the activation script is running – systemd will wait for a little while before sending out a SIGKILL. This grace time might be sufficient to complete the (re-)activation;
  • Invoke the activation script with systemd-run – this could potentially avoid issues with current command vanishing from the unit file, and also helps with things like sighup more holistically.
    • In order to make rollbacks work reliably, it may be necessary to invoke a “guard” unit via systemd-run as well which would monitor the execution of the first systemd-run invocation.

nagisa avatar Apr 16 '22 13:04 nagisa