deploy-rs icon indicating copy to clipboard operation
deploy-rs copied to clipboard

Rollback gives success errorcode output sometimes?

Open wmertens opened this issue 1 year ago • 10 comments

In my scripts I call deploy-rs like this (simplified):

	if deploy -s --ssh-opts="-F $SSH_CONFIG_FILE" ".#\"$NAME\".system" "${@}"; then
		echo "Deploy successful, marking in git origin/@$NAME"

So that I can track which commit hosts are on.

I just noticed a host that deployed successfully but timed out the confirmation, and it rolls back, and then calls it a success:

[myhost.system]
user = "root"
ssh_user = "root"
path = "/nix/store/6ndyilpqalk9khk4cs30mr4yxgc26qvl-activatable-nixos-system-myhost-22.11.20230512.9656e85"
hostname = "myhost.mydomain"
ssh_opts = ["-F", "/home/wmertens/nix-infra/.direnv/ssh/ssh_config"]

🚀 ℹ [deploy] [INFO] Building profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Copying profile `system` to node `myhost`
🚀 ℹ [deploy] [INFO] Activating profile `system` for node `myhost`
🚀 ℹ [deploy] [INFO] Creating activation waiter
⭐ ℹ [activate] [INFO] Activating profile
🚀 ℹ [deploy] [INFO] Success activating, attempting to confirm activation
🚀 ℹ [deploy] [INFO] Deployment confirmed.
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
⭐ ℹ [activate] [INFO] Activation succeeded!
⭐ ℹ [activate] [INFO] Magic rollback is enabled, setting up confirmation hook...
⭐ ℹ [activate] [INFO] Waiting for confirmation event...
⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation
⭐ ⚠ [activate] [WARN] De-activating due to error
switching profile from version 39 to 38
⭐ ⚠ [activate] [WARN] Removing generation by ID 39
removing profile version 39
⭐ ℹ [activate] [INFO] Attempting to re-activate the last generation
updating GRUB 2 menu...
stopping the following units: ...
[...]
starting the following units: ...
Deploy successful, marking in git origin/@myhost

So as you can see, it now erroneously marked the host on the new commit even though it rolled back.

Normally the error return is correct, perhaps it only happens when the activation succeeded but was rolled back?

wmertens avatar May 17 '23 05:05 wmertens

~This is expected behaviour, see #179 and #181~ Disregard please, see my next comment

⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation

As a side note, I recommend using --confirm-timeout option as a workaround

rvem avatar May 17 '23 07:05 rvem

Oh, sorry, I misinterpreted your initial problem. Zero exit code isn't expected after rollback. We'll take a look

rvem avatar May 17 '23 07:05 rvem

Is there a chance that you're using the old version of deploy-rs in one of the places?

rvem avatar May 17 '23 07:05 rvem

Hmm possible, I'm using the nixpkgs version (master branch) because rust takes forever to compile. But the deploys should have the same versions of deploy-rs on both sides, right?

wmertens avatar May 17 '23 13:05 wmertens

But the deploys should have the same versions of deploy-rs on both sides, right?

Ugh, this is tricky, see https://github.com/serokell/deploy-rs/pull/207. Before this PR, activate-rs script always used the version from flake input

rvem avatar May 18 '23 03:05 rvem

I'm using the nixpkgs version (master branch) because rust takes forever to compile

Hmm, the version from the current nixpkgs master seems to be new enough.

rvem avatar May 18 '23 03:05 rvem

Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent. Deploying the same closure again first immediately terminates the wait command so the deploy continues, then activates remotely and rolls back because the deploy has already gone to next steps.

Solved by running sudo rm /tmp/deploy-rs-canary-* on the target machine

elikoga avatar Nov 19 '23 02:11 elikoga

Got this behaviour if deploying the same closure, and previously i somehow reached a state where the canary file is persistent.

I saw such a behaviour previously as well. Perhaps, a canary file for the current profile should be cleaned prior to redeployment.

rvem avatar Nov 20 '23 08:11 rvem

I think one of the newer deployment tools with support for something similar to magic rollback used UNIX sockets for confirmation, maybe that might be the better way to go for reliability? I considered it initially actually, but I went with the whole inotify canary file thing thinking "oh if it's just files it's easy to look at and poke at any activation issues", unfortunately this design may be the source of all of the activation issues, so maybe we might be better off doing something else.

Honestly I never thought too much about the right way to do confirmations, I got swept away with the thought of "holy shit I can use profile rollbacks to make it impossible (slightly harder?) to break your server" and threw together one of the first thoughts that came to mind, which has since gone through various patches by various people, but maybe re-designing/re-implementing the whole concept might be easier than fixing the issues.

notgne2 avatar Nov 20 '23 09:11 notgne2

The issue with

⭐ ❌ [activate] [ERROR] Error waiting for confirmation event: Timeout elapsed for confirmation

returning 0 exit code was fixed by #246

rvem avatar Dec 18 '23 11:12 rvem