deploy-rs
deploy-rs copied to clipboard
Rollback gives success errorcode output sometimes?
In my scripts I call deploy-rs like this (simplified):
if deploy -s --ssh-opts="-F $SSH_CONFIG_FILE" ".#\"$NAME\".system" "${@}"; then
echo "Deploy successful, marking in git origin/@$NAME"
So that I can track which commit hosts are on.
I just noticed a host that deployed successfully but timed out the confirmation, and it rolls back, and then calls it a success:
[myhost.system]
user = "root"
ssh_user = "root"
path = "/nix/store/6ndyilpqalk9khk4cs30mr4yxgc26qvl-activatable-nixos-system-myhost-22.11.20230512.9656e85"
hostname = "myhost.mydomain"
ssh_opts = ["-F", "/home/wmertens/nix-infra/.direnv/ssh/ssh_config"]
🚀 ℹ [deploy] [INFO] Building profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Copying profile `system` to node `myhost`
🚀 ℹ [deploy] [INFO] Activating profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Creating activation waiter
⭐ ℹ [activate] [INFO] Activating profile
🚀 ℹ [deploy] [INFO] Success activating, attempting to confirm activation
🚀 ℹ [deploy] [INFO] Deployment confirmed.
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
⭐ ℹ [activate] [INFO] Activation succeeded!
⭐ ℹ [activate] [INFO] Magic rollback is enabled, setting up confirmation hook...
⭐ ℹ [activate] [INFO] Waiting for confirmation event...
⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation
⭐ ⚠ [activate] [WARN] De-activating due to error
switching profile from version 39 to 38
⭐ ⚠ [activate] [WARN] Removing generation by ID 39
removing profile version 39
⭐ ℹ [activate] [INFO] Attempting to re-activate the last generation
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
Deploy successful, marking in git origin/@myhost
So as you can see, it now erroneously marked the host on the new commit even though it rolled back.
Normally the error return is correct, perhaps it only happens when the activation succeeded but was rolled back?
~This is expected behaviour, see #179 and #181~ Disregard please, see my next comment
⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation
As a side note, I recommend using --confirm-timeout
option as a workaround
Oh, sorry, I misinterpreted your initial problem. Zero exit code isn't expected after rollback. We'll take a look
Is there a chance that you're using the old version of deploy-rs
in one of the places?
Hmm possible, I'm using the nixpkgs version (master branch) because rust takes forever to compile. But the deploys should have the same versions of deploy-rs on both sides, right?
But the deploys should have the same versions of deploy-rs on both sides, right?
Ugh, this is tricky, see https://github.com/serokell/deploy-rs/pull/207. Before this PR, activate-rs
script always used the version from flake input
I'm using the nixpkgs version (master branch) because rust takes forever to compile
Hmm, the version from the current nixpkgs master seems to be new enough.
Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent. Deploying the same closure again first immediately terminates the wait command so the deploy continues, then activates remotely and rolls back because the deploy has already gone to next steps.
Solved by running sudo rm /tmp/deploy-rs-canary-*
on the target machine
Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent.
I saw such a behaviour previously as well. Perhaps, a canary file for the current profile should be cleaned prior to redeployment.
I think one of the newer deployment tools with support for something similar to magic rollback used UNIX sockets for confirmation, maybe that might be the better way to go for reliability? I considered it initially actually, but I went with the whole inotify canary file thing thinking "oh if it's just files it's easy to look at and poke at any activation issues", unfortunately this design may be the source of all of the activation issues, so maybe we might be better off doing something else.
Honestly I never thought too much about the right way to do confirmations, I got swept away with the thought of "holy shit I can use profile rollbacks to make it impossible (slightly harder?) to break your server" and threw together one of the first thoughts that came to mind, which has since gone through various patches by various people, but maybe re-designing/re-implementing the whole concept might be easier than fixing the issues.
The issue with
⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation
returning 0 exit code was fixed by #246