nixops icon indicating copy to clipboard operation
nixops copied to clipboard

Deploy Targets: Policy/Behavior-free Deployment Hooks (auto-rollbacks, drain events, etc.)

Open grahamc opened this issue 5 years ago • 10 comments

(don't merge yet)

This PR adds trivial targets without any particular behavior or policy to NixOps. These targets allows the system to react to deployment events, and for users to customize the policy of how to react to their use case.

In particular, this PR is authored with the intent of providing safe and automatic rollbacks for embedded NixOS systems which are hard to access and repair. Thank you to Yakkertech for sponsoring this work.

The following targets are added, with a note of when they become active:

  • deploy-prepare.target - before a new system is activated.
  • deploy-healthy.target - after NixOps reconnects over SSH.
  • deploy-failed.target - if deploy-health.target fails.
  • deploy-complete.target - after every system in the given deployment (ie: --include / --exclude) completes successfully.

During a deploy, for each system, each system has the following steps executed independently:

  1. copy the system closure
  2. start deploy-prepare.target. If this fails, fail this machine's deploy.
  3. run the new system's activation script
  4. close the SSH connection
  5. open a new SSH connection. if this fails, the next step is skipped.
  6. start deploy-healthy.target. If this fails, deploy-failed.target is started automatically, and NixOps sees an error.

Once every server has completed these steps, if they all completed successfully NixOps will SSH to each machine and start deploy-completed.target.

Importantly, none of these targets actually do anything on their own. NixOps has expressed no preference or policy or behavior. @adisbladis and I feel the space is too large and complicated for any one implementation to get it right.

Current Problems and To-Do

  • During activation, NixOS's switch-to-configuration.pl starts deploy-prepare.target and its dependents, causing preparation steps to run too many times, and at innapropriate times. @adisbladis will be working on a solution to this Monday.
  • The Before, Requires, and other directives are very sensitive. We need to provide very precise and accurate tests and documentation on exactly how a user can and should hook in to these targets.
  • If deploy-prepare gets stuck and will only refuse a deployment, NixOps has no way to ignore that and deploy anyway. This is a problem, because there is no way to automatically recover without manually SSHing in and rolling back to some other system version.

Future Work

  • Potentially move the targets in to NixOS itself, teaching nixos-rebuild about these targets.
  • Publish to some well-known location a set of expressions for ways these targets can be used to safely implement various use cases.

Use Cases

Draining Phase

Hooking an event in to deploy-prepare.target allows the server itself to delay a deployment. This can be used to, for example:

  • remove itself from a load balancer
  • wait for an important measurement to complete
  • wait for in-progress builds to complete
  • wait for the mail queue to empty

Example, a server removing itself from an ELB before the deployment, and adding itself after the system's deployment is healthy:

{ pkgs, ... }: {
  systemd.targets.deploy-prepare.unitConfig.Requires = [ "leave-elb.service" ];
  systemd.services.leave-elb = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.awscli ];

    script = ''
      aws elb deregister-instances-from-load-balancer \
          --load-balancer-name prod-web-traffic-load-balancer \
          --instance my-instance-id | jq .
    '';
  };

  systemd.targets.deploy-healthy.unitConfig.Requires = [ "join-elb.service" ];
  systemd.services.join-elb = {
    unitConfig.After = [ "deploy-healthy.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.awscli ];

    script = ''
      aws elb register-instances-with-load-balancer \
          --load-balancer-name prod-web-traffic-load-balancer \
          --instance my-instance-id | jq .
    '';
  };
}

Critical Window Protection

Hooking an event in to deploy-prepare.target allows the server itself to prevent a deployment, too. This could be used to protect a system from deployments during a critical time window. For example, a system which absolutely must remain undisturbed during a critical production event could flat out refuse the deployment:

Example, a server which refuses deploys after 12:

{ pkgs, ... }: {
  systemd.targets.deploy-prepare.unitConfig.Requires = [
    "no-afternoon-deploys.service"
  ];
  systemd.services.no-afternoon-deploys = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;

    script = ''
      hour=$(date +%H)
      if [ $hour -ge 12 ]; then
        echo "Don't deploy during the afternoon!"
        exit 1
      fi
    '';
  };
}

Coordinated Distributed Database Maintenance

Many distributed databases will identify a failed machine and begin reallocated data, assuming the old machine will not come back. A graceful, coordinated shutdown can fix this.

For example, with Elasticsearch it is important to disable shard allocation during a deployment, and forcing a synced flush will improve recovery time.

Elasticsearch in particular is interesting, because we can use the deploy-complete hook to ensure every machine has finished before re-enabling allocation: something NixOS and NixOps do not easily support right now.

{ config, pkgs, ... }:
let
  esConfig = config.services.elasticsearch;
  esUrl = "http://${esConfig.listenAddress}:${esConfig.port}";
in {
  systemd.targets.deploy-prepare.unitConfig.Requires = [
    "elasticsearch-pre-deploy.service"
  ];
  systemd.services.elasticsearch-pre-deploy = {
    unitConfig.Before = [ "deploy-prepare.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.curl ];

    script = ''
      echo "Disabling shard replication during node shutdown..."

      curl -X PUT "${esUrl}/_cluster/settings?pretty" \
        -H 'Content-Type: application/json' -d'
        {
          "persistent": {
            "cluster.routing.allocation.enable": "primaries"
          }
        }
      '

      echo "Forcing a synced flush to speed up recovery"
      curl -X POST "${esUrl}/_flush/synced?pretty" \
    '';
  };

  systemd.targets.deploy-complete.unitConfig.Requires = [
    "elasticsearch-post-deploy.service"
  ];
  systemd.services.elasticsearch-post-deploy = {
    unitConfig.After = [ "deploy-complete.target" ];
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    path = [ pkgs.curl ];

    script = ''
      echo "Re-enabling allocation"
      curl -X PUT "${esUrl}" \
        -H 'Content-Type: application/json' -d'
        {
          "persistent": {
            "cluster.routing.allocation.enable": null
          }
        }
      '
    '';
  };
}

Automatic Rollback After a Failed Deployment

With the addition of a timer, we can implement an automatic rollback in not very much code. In this example, the automatic rollback is triggered in two cases:

  • deploy-healthy.target is activated, but fails
  • deploy-healthy.target is not activated within 1 minute of deploy-prepare.target, indicating the new system configuration broke the network or system in a way preventing the deployment host from confirming the new system.
{ pkgs, ... }:
{
  systemd.services.automatic-rollback = {
    description = "Automatic rollback";
    wantedBy = [ "deploy-failed.target" ];
    script = ''
      echo "Rolling back!"
      nix-env --rollback -p /nix/var/nix/profiles/system
      /nix/var/nix/profiles/system/bin/switch-to-configuration switch
    '';
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = false;
  };

  systemd.timers.automatic-rollback = {
    enable = true;
    wantedBy = [ "deploy-prepare.target" ];

    unitConfig = {
      Conflicts = [ "deploy-healthy.target" ];
    };
    timerConfig = {
      OnActiveSec = "1m";
      RemainAfterElapse = false;
    };
  };
}

note: the examples are just examples and have not been tested

grahamc avatar Mar 06 '20 20:03 grahamc

During a deploy, for each system, each system has the following steps executed independently:

I think it probably makes sense to have these steps depend on another for the sake of not having partial deploys: All deploy-prepare.targets should be required to finish activating before any activation scripts are called. Similar to how a two-phase commit protocol works.

infinisil avatar Mar 09 '20 02:03 infinisil

Potentially move the targets in to NixOS itself, teaching nixos-rebuild about these targets.

It seems like for NixOS there are a couple of Nixops-indepentent stages that could have targets:

  • pre-activation: activated by nixos-rebuild before it begins activating the system
  • post-activation: activated by nixos-rebuild after it finishes activating the system.

I'm not sure if we want to do this, but we could attach some semantics to these that would be useful even outside Nixops. For example, we could have nixos-rebuild do an automatic rollback if post-activation fails. Then a user could add a dependency from post-activation to a particular service and that would automatically roll back any upgrade where that service is failing afterwards.

If we did this, then deploy-prepare could perhaps be replaced by pre-activation. deploy-healthy is a bit more than post-activation, since it also includes the "external" health check of Nixops being able to reconnect to the system.

michaelpj avatar Mar 09 '20 14:03 michaelpj

@Infinisil, I like that idea.

@michaelpj,

Potentially move the targets in to NixOS itself, teaching nixos-rebuild about these targets.

It seems like for NixOS there are a couple of Nixops-indepentent stages that could have targets:

I think you're right, and these sorts of additional targets are exactly why I hesitate on implementing this in NixOS directly. I think there are so many use cases and details to consider. I like the idea of renaming deploy-prepare.target to pre-activation.target, and creating a post-activation.target in addition to the deploy-healthy.target.

I'm not sure if we want to do this, but we could attach some semantics to these that would be useful even outside Nixops.

Since the targets in this these are policy and behavior free, the only value in these targets is if they have defined semantics from the start.

For example, we could have nixos-rebuild do an automatic rollback if post-activation fails.

This gets a little bit tricky, because tools like NixOps don't use nixos-rebuild -- so we need to be 100% certain the behavior and semantics match across the ecosystem.

Additionally, the one thing I've learned researching this implementation is applying any behavior by default is likely risky: there are so many different use cases and policies. In addition, these specific targets are potentially only the beginning. This is also why I don't want to yet introduce this in to NixOS itself.

Consider a case of a more careful process around bootloaders, we very well may want to implement behavior like:

  1. copy closure
  2. test-activate, if failed roll back
  3. validate the connection works
  4. write a bootloader in a "one-time-boot" mode
  5. next boot, if it fails, reboot again causing a bootloader-level rollback
  6. on-boot, confirm the new, booted system can be booted and is permanently defaulted in the bootloader

This potentially requires cooperation from something like https://www.intel.com/content/www/us/en/support/articles/000007197/server-products/server-boards.html

My goal for this PR specifically is to make a small nibble of progress towards a more robust confirmed-safe deployment process.

grahamc avatar Mar 09 '20 15:03 grahamc

Since the targets in this these are policy and behavior free, the only value in these targets is if they have defined semantics from the start.

I take your point, and I agree that adding behaviour complicates things a lot!

pre-activation and post-activation have the advantage that they have a clear meaning: you run them before and after the activation script, regardless of whether that's invoked by nixos-rebuild or nixops.

Perhaps automatic rollbacks are too scary a behaviour, but something simpler like triggering a monitoring endpoint might be a more obviously okay use of post-activation.

michaelpj avatar Mar 09 '20 15:03 michaelpj

@grahamc The rollback behavior you described there is pretty much how I implemented it here: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/scripts/switch, and the accompanying Nix code: https://github.com/Infinisil/nixoses/blob/42ae5dd61a7c65c172bb5d27785879fdbd6c3516/modules/deploy.nix

Has been working out pretty well for me

infinisil avatar Mar 09 '20 17:03 infinisil

edited to delete a bunch of copypasta which showed up for reasons I can't explain

Perhaps automatic rollbacks are too scary a behaviour, but something simpler like triggering a monitoring endpoint might be a more obviously okay use of post-activation.

On the contrary, this is exactly what I'm using it for! But it is, importantly, up to the people maintaining the machines what the right use case is :).

grahamc avatar Mar 09 '20 17:03 grahamc

I went to implement starting deploy-prepare.target on each host first, and it reminded me of why I chose to not do that in the first place. Since the timer may start when deploy-prepare.target is reached, it is important that the very next step is activation. Otherwise we create a pretty nasty race condition on large networks, or networks with slow links.

I think this hints towards an additional target, deploy-begin.target, and renaming deploy-prepare.target to pre-activation.target.

grahamc avatar Mar 09 '20 17:03 grahamc

A question is should nixops start deploy-begin...

I'm inclined to do it right before activation, and after copying the closure.

And moreso, I'm thinking only doing it if the deployment mode is test or switch, and not in a build-only, copy-only, or dry-activate mode.

grahamc avatar Mar 09 '20 21:03 grahamc

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update/6525/1

nixos-discourse avatar Apr 01 '20 16:04 nixos-discourse

@grahamc @adisbladis may I ask if there's any chance that this will be picked up at some point? I could help out a bit maybe if somebody tells me what's missing / tbd here :)

Ma27 avatar May 18 '21 18:05 Ma27

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/seamless-nixos-rebuild-switch-with-network-restart/29312/3

nixos-discourse avatar Jun 20 '23 14:06 nixos-discourse