deploy-rs Rollback gives success errorcode output sometimes?

In my scripts I call deploy-rs like this (simplified):

	if deploy -s --ssh-opts="-F $SSH_CONFIG_FILE" ".#\"$NAME\".system" "${@}"; then
		echo "Deploy successful, marking in git origin/@$NAME"

So that I can track which commit hosts are on.

I just noticed a host that deployed successfully but timed out the confirmation, and it rolls back, and then calls it a success:

[myhost.system]
user = "root"
ssh_user = "root"
path = "/nix/store/6ndyilpqalk9khk4cs30mr4yxgc26qvl-activatable-nixos-system-myhost-22.11.20230512.9656e85"
hostname = "myhost.mydomain"
ssh_opts = ["-F", "/home/wmertens/nix-infra/.direnv/ssh/ssh_config"]

🚀 ℹ [deploy] [INFO] Building profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Copying profile `system` to node `myhost`
🚀 ℹ [deploy] [INFO] Activating profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Creating activation waiter
⭐ ℹ [activate] [INFO] Activating profile
🚀 ℹ [deploy] [INFO] Success activating, attempting to confirm activation
🚀 ℹ [deploy] [INFO] Deployment confirmed.
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
⭐ ℹ [activate] [INFO] Activation succeeded!
⭐ ℹ [activate] [INFO] Magic rollback is enabled, setting up confirmation hook...
⭐ ℹ [activate] [INFO] Waiting for confirmation event...
⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation
⭐ ⚠ [activate] [WARN] De-activating due to error
switching profile from version 39 to 38
⭐ ⚠ [activate] [WARN] Removing generation by ID 39
removing profile version 39
⭐ ℹ [activate] [INFO] Attempting to re-activate the last generation
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
Deploy successful, marking in git origin/@myhost

So as you can see, it now erroneously marked the host on the new commit even though it rolled back.

Normally the error return is correct, perhaps it only happens when the activation succeeded but was rolled back?

May 17 '23 05:05 wmertens

~This is expected behaviour, see #179 and #181~ Disregard please, see my next comment

⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation

As a side note, I recommend using --confirm-timeout option as a workaround

May 17 '23 07:05 rvem

Oh, sorry, I misinterpreted your initial problem. Zero exit code isn't expected after rollback. We'll take a look

May 17 '23 07:05 rvem

Is there a chance that you're using the old version of deploy-rs in one of the places?

May 17 '23 07:05 rvem

Hmm possible, I'm using the nixpkgs version (master branch) because rust takes forever to compile. But the deploys should have the same versions of deploy-rs on both sides, right?

May 17 '23 13:05 wmertens

But the deploys should have the same versions of deploy-rs on both sides, right?

Ugh, this is tricky, see https://github.com/serokell/deploy-rs/pull/207. Before this PR, activate-rs script always used the version from flake input

May 18 '23 03:05 rvem

I'm using the nixpkgs version (master branch) because rust takes forever to compile

Hmm, the version from the current nixpkgs master seems to be new enough.

May 18 '23 03:05 rvem

Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent. Deploying the same closure again first immediately terminates the wait command so the deploy continues, then activates remotely and rolls back because the deploy has already gone to next steps.

Solved by running sudo rm /tmp/deploy-rs-canary-* on the target machine

Nov 19 '23 02:11 elikoga

Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent.

I saw such a behaviour previously as well. Perhaps, a canary file for the current profile should be cleaned prior to redeployment.

Nov 20 '23 08:11 rvem

I think one of the newer deployment tools with support for something similar to magic rollback used UNIX sockets for confirmation, maybe that might be the better way to go for reliability? I considered it initially actually, but I went with the whole inotify canary file thing thinking "oh if it's just files it's easy to look at and poke at any activation issues", unfortunately this design may be the source of all of the activation issues, so maybe we might be better off doing something else.

Honestly I never thought too much about the right way to do confirmations, I got swept away with the thought of "holy shit I can use profile rollbacks to make it impossible (slightly harder?) to break your server" and threw together one of the first thoughts that came to mind, which has since gone through various patches by various people, but maybe re-designing/re-implementing the whole concept might be easier than fixing the issues.

Nov 20 '23 09:11 notgne2

The issue with

⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation

returning 0 exit code was fixed by #246

Dec 18 '23 11:12 rvem

deploy-rs deploy-rs copied to clipboard

Rollback gives success errorcode output sometimes?

deploy-rs
deploy-rs copied to clipboard