colmena icon indicating copy to clipboard operation
colmena copied to clipboard

Magic Rollback

Open Ekleog opened this issue 2 years ago • 7 comments

Hello,

I'm just now learning of colmena, and it looks great! I'm just opening this issue because I was originally looking for a flakeless alternative to deploy-rs, and having looked at it made me see its magic rollback feature.

How would you feel about implementing something like that for colmena?

I think the process would be to do something like:

  1. upload the closure
  2. run something akin to nixos-rebuild test; (sleep 300 && rollback) &
  3. then open a new ssh connection (without ControlMaster, so with ssh -S none)
  4. have the new ssh connection kill the sleep 300 && rollback command

I think that by using grub's robustness mechanisms it'd even be possible to have a similar mechanism working for when the system needs to be rebooted, which would be an awesome feature that AFAICT no other deployment mechanism has, but that'd certainly be a lot more work to implement.

Ekleog avatar Nov 19 '21 17:11 Ekleog

That's an interesting idea! I think an easy way to implement this is with a list of scripts (deployment.checkScripts maybe?) that are run on the target machine after the configuration is activated. If any of the scripts fail (e.g., internet connectivity check), the configuration is rolled back.

zhaofengli avatar Nov 21 '21 07:11 zhaofengli

This would sound great! Though maybe even better if there were a way to get “default” check scripts so that everyone doesn't have to redevelop similar check scripts.

Actually that makes me think: it'd also require a deployment.localCheckScripts that would contain scripts ran on the deployer host, so that it could try to ssh into the being-deployed host, as forward-checks are not the same as backwards-checks if eg. the firewall is misconfigured in the new configuration.

Ekleog avatar Nov 29 '21 01:11 Ekleog

I have to say I regularly have to switch off auto-rollback in deploy-rs because it's too indiscriminate.

So being able to define the checks oneself, is a great design.

Between that and a default, is the option of a library, still.

Defaults are comfortable, but intransparent.

blaggacao avatar Apr 07 '22 12:04 blaggacao

I've been using colmena to deploy some DigitalOcean machines and while reworking the network config have lost connectivity few times, so having checks is very welcome.

otavio avatar Apr 07 '22 12:04 otavio

@zhaofengli I noticed a comment mentioning an auto-rollback.sh script, but I can't find its source -- is this part of an implementation of this feature?

asymmetric avatar Mar 10 '23 10:03 asymmetric

@zhaofengli I noticed a comment mentioning an auto-rollback.sh script, but I can't find its source -- is this part of an implementation of this feature?

Right now there is no such file. It is still a planning feature, but I am working on an alternative way to fit my need about rollback.

NeverBehave avatar May 30 '23 21:05 NeverBehave

Here is a quick idea how this would work, but still under construction. The idea about rollback has several implication behind:

  1. Assumes the last usable profile(target rollback) is current - 1
  2. Flexible to implement user defined checks(who handles the check and how)
  3. System should be able to reboot and still do auto rollback (in case of applying kernel upgrade breaks network connections aka driver issue)

Here is an sample script:

let 
  timestamp = "/var/auto-rollback-timestamp";
  check-period = "*:0,30"; # every 30 mins
  wait-period = "30"; # allow 30 minutes to recover
in {
        deployment.keys."auto-rollback-timestamp" = {
                keyFile = ./blank; # An empty file
                destDir = "/var"; # Default: /run/keys
                permissions = "0755"; # Default: 0600
      
                uploadAt = "pre-activation";
        };

        system.activationScripts = {
            auto-rollback = {
                text = ''
                    # If auto-rollback-timestamp file exists and it is empty, write timestamp to file
                    if [ -f "${timestamp}" ]; then
                        if [ ! -s "${timestamp}" ]; then
                            date +%s > ${timestamp}
                        fi
                    fi
                '';
            };
        };

        systemd.timers.auto-rollback = {
            wantedBy = ["timers.target"];
            timerConfig = {
                OnCalendar = check-period;
                Unit = "auto-rollback.service";
            };
        };

        systemd.services.auto-rollback = {
            restartIfChanged = false;
            serviceConfig = {
            Type = "oneshot";
            ExecStart = pkgs.writeScript "auto-rollback" ''
                #!${pkgs.runtimeShell} -e

                timestamp="${timestamp}"

                function rollback_to_previous_profile {
                    rm $timestamp

                    # Find current generation
                    # Populate the array with file and folder names
                    while IFS= read -r file; do
                        files+=("$file")
                    done < <(ls -lt "/nix/var/nix/profiles" | ${pkgs.gawk}/bin/awk 'NR > 1 {print $NF}')
                    last_profile=''${files[2]}
                    echo "Switch back to Profile: $last_profile"
                    $last_profile/bin/switch-to-configuration switch
                }
                
                if [ ! -f "$timestamp" ]; then
                    echo "File does not exist: $timestamp, skipping rollback checks"
                    exit 0
                fi

                stored_timestamp=$(cat "$timestamp")
                number_regex='^[0-9]+$'

                if [[ $stored_timestamp =~ $number_regex ]]; then
                    echo "Valid timestamp: $stored_timestamp"
                else
                    echo "Invalid timestamp! Rolling back right now..."
                    rollback_to_previous_profile
                fi

                wait_period="${toString wait-period}"

                echo "checking timestamp $stored_timestamp"
                current_time=$(${pkgs.coreutils}/bin/date +%s)
                threshold=$((current_time - (wait_period * 60)))

                echo "Threshold $threshold"
                if [ "$stored_timestamp" -lt "$threshold" ]; then
                    echo "Current time has a difference of more than $wait_period minutes compared to the stored timestamp."
                    rollback_to_previous_profile
                else
                    echo "Current time is within $wait_period minutes of the stored timestamp."
                    exit 0
                fi
            '';
            };
        };
}

Here is the idea how the above scripts works:

  1. pre-activation: colmena will upload an empty file to given directory
  2. activation script: if file exists and it is empty, write timestamp
  3. The systemd services will repeatly check if the file exists and if corresponded timestamp has reach maximum wait time, delete the file and rollback
  4. To stop rollback, delete /var/auto-rollback-timestamp

The good thing about this design is that it provide flexible way for user to implement rollback: to stop it simply delete the file. Default I will do colmena exec rm /var/auto-rollback-timestamp, also user could run scripts that deletes the file.

However the timer approach may introduce race condition and there is a constant timer in the system checking the file. This could be better addressed by have the systemd service trigger every time of deployment/onBootSec only.

I am not entirely sure if this is a good approach at all, nor how this could be part of the colmena, or deploy-rs is using similar strategy(which I should take a look), but I will leave it here for now and see if it will help others who needs a solution atm.


We are currently using this approach in production system and it works like, saving ass for twice at least :) Feel free to provide feedback

NeverBehave avatar May 30 '23 21:05 NeverBehave