colmena
colmena copied to clipboard
Magic Rollback
Hello,
I'm just now learning of colmena, and it looks great! I'm just opening this issue because I was originally looking for a flakeless alternative to deploy-rs, and having looked at it made me see its magic rollback feature.
How would you feel about implementing something like that for colmena?
I think the process would be to do something like:
- upload the closure
- run something akin to
nixos-rebuild test; (sleep 300 && rollback) &
- then open a new ssh connection (without ControlMaster, so with
ssh -S none
) - have the new ssh connection kill the
sleep 300 && rollback
command
I think that by using grub's robustness mechanisms it'd even be possible to have a similar mechanism working for when the system needs to be rebooted, which would be an awesome feature that AFAICT no other deployment mechanism has, but that'd certainly be a lot more work to implement.
That's an interesting idea! I think an easy way to implement this is with a list of scripts (deployment.checkScripts
maybe?) that are run on the target machine after the configuration is activated. If any of the scripts fail (e.g., internet connectivity check), the configuration is rolled back.
This would sound great! Though maybe even better if there were a way to get “default” check scripts so that everyone doesn't have to redevelop similar check scripts.
Actually that makes me think: it'd also require a deployment.localCheckScripts
that would contain scripts ran on the deployer host, so that it could try to ssh into the being-deployed host, as forward-checks are not the same as backwards-checks if eg. the firewall is misconfigured in the new configuration.
I have to say I regularly have to switch off auto-rollback in deploy-rs
because it's too indiscriminate.
So being able to define the checks oneself, is a great design.
Between that and a default, is the option of a library, still.
Defaults are comfortable, but intransparent.
I've been using colmena to deploy some DigitalOcean machines and while reworking the network config have lost connectivity few times, so having checks is very welcome.
@zhaofengli I noticed a comment mentioning an auto-rollback.sh
script, but I can't find its source -- is this part of an implementation of this feature?
@zhaofengli I noticed a comment mentioning an
auto-rollback.sh
script, but I can't find its source -- is this part of an implementation of this feature?
Right now there is no such file. It is still a planning feature, but I am working on an alternative way to fit my need about rollback.
Here is a quick idea how this would work, but still under construction. The idea about rollback has several implication behind:
- Assumes the last usable profile(target rollback) is current - 1
- Flexible to implement user defined checks(who handles the check and how)
- System should be able to reboot and still do auto rollback (in case of applying kernel upgrade breaks network connections aka driver issue)
Here is an sample script:
let
timestamp = "/var/auto-rollback-timestamp";
check-period = "*:0,30"; # every 30 mins
wait-period = "30"; # allow 30 minutes to recover
in {
deployment.keys."auto-rollback-timestamp" = {
keyFile = ./blank; # An empty file
destDir = "/var"; # Default: /run/keys
permissions = "0755"; # Default: 0600
uploadAt = "pre-activation";
};
system.activationScripts = {
auto-rollback = {
text = ''
# If auto-rollback-timestamp file exists and it is empty, write timestamp to file
if [ -f "${timestamp}" ]; then
if [ ! -s "${timestamp}" ]; then
date +%s > ${timestamp}
fi
fi
'';
};
};
systemd.timers.auto-rollback = {
wantedBy = ["timers.target"];
timerConfig = {
OnCalendar = check-period;
Unit = "auto-rollback.service";
};
};
systemd.services.auto-rollback = {
restartIfChanged = false;
serviceConfig = {
Type = "oneshot";
ExecStart = pkgs.writeScript "auto-rollback" ''
#!${pkgs.runtimeShell} -e
timestamp="${timestamp}"
function rollback_to_previous_profile {
rm $timestamp
# Find current generation
# Populate the array with file and folder names
while IFS= read -r file; do
files+=("$file")
done < <(ls -lt "/nix/var/nix/profiles" | ${pkgs.gawk}/bin/awk 'NR > 1 {print $NF}')
last_profile=''${files[2]}
echo "Switch back to Profile: $last_profile"
$last_profile/bin/switch-to-configuration switch
}
if [ ! -f "$timestamp" ]; then
echo "File does not exist: $timestamp, skipping rollback checks"
exit 0
fi
stored_timestamp=$(cat "$timestamp")
number_regex='^[0-9]+$'
if [[ $stored_timestamp =~ $number_regex ]]; then
echo "Valid timestamp: $stored_timestamp"
else
echo "Invalid timestamp! Rolling back right now..."
rollback_to_previous_profile
fi
wait_period="${toString wait-period}"
echo "checking timestamp $stored_timestamp"
current_time=$(${pkgs.coreutils}/bin/date +%s)
threshold=$((current_time - (wait_period * 60)))
echo "Threshold $threshold"
if [ "$stored_timestamp" -lt "$threshold" ]; then
echo "Current time has a difference of more than $wait_period minutes compared to the stored timestamp."
rollback_to_previous_profile
else
echo "Current time is within $wait_period minutes of the stored timestamp."
exit 0
fi
'';
};
};
}
Here is the idea how the above scripts works:
- pre-activation: colmena will upload an empty file to given directory
- activation script: if file exists and it is empty, write timestamp
- The systemd services will repeatly check if the file exists and if corresponded timestamp has reach maximum wait time, delete the file and rollback
- To stop rollback, delete
/var/auto-rollback-timestamp
The good thing about this design is that it provide flexible way for user to implement rollback: to stop it simply delete the file. Default I will do colmena exec rm /var/auto-rollback-timestamp
, also user could run scripts that deletes the file.
However the timer approach may introduce race condition and there is a constant timer in the system checking the file. This could be better addressed by have the systemd service trigger every time of deployment/onBootSec only.
I am not entirely sure if this is a good approach at all, nor how this could be part of the colmena, or deploy-rs is using similar strategy(which I should take a look), but I will leave it here for now and see if it will help others who needs a solution atm.
We are currently using this approach in production system and it works like, saving ass for twice at least :) Feel free to provide feedback