comin icon indicating copy to clipboard operation
comin copied to clipboard

what happens on failures?

Open tcurdt opened this issue 1 year ago • 13 comments

What happens when something fails after a deploy? Is there any way to auto roll back?

tcurdt avatar Nov 26 '24 15:11 tcurdt

@tcurdt Currently, comin doesn't have anything to detect a failure after a deployment, so it can't do anything on failure.

Do you have something in mind?

Maybe comin could watch a liveness probe in order to be able to rollback. However, i don't know if it should watch for some prometheus metrics, a systemd unit state or executing a script after the deployment to ensure the machine still works.

(When i'm not sure about a deployment, i currently use a testing branch in order to be able to reboot the server if it is no longer responding.)

nlewo avatar Nov 26 '24 17:11 nlewo

I guess the main problem is defining what is considered a system that is alive and well. As you already hinted, things that come to mind are:

  • http ping
  • prometheus metric
  • systemd unit state
  • script exit code

All of them would need some kind of grace period after the deploy. If any of them failed, rollback to a working state. But what happens on the next pull? I guess it would need to keep the SHAs that were tried (somewhere).

tcurdt avatar Nov 26 '24 17:11 tcurdt

Another somewhat related thought:

Is there a way to check the current SHA of the system yet? I see it is mentioned for the prometheus metrics.

tcurdt avatar Dec 02 '24 11:12 tcurdt

Ah, seems like the status page provides it

curl 127.0.0.1:4242/status

tcurdt avatar Dec 02 '24 11:12 tcurdt

I'm currently using deploy-rs to deploy my machines, but I'm also interested in a pull based deployment. The main reason for me is that not all the systems we maintain have permanent internet access.

As most of our systems are remote and not easy to access, it is very important to be sure that a deployment doesn't break anything. deploy-rs has a magic rollback feature that I like. I think it deploys and then checks if ssh connectivity is still working after deployment. It also rolls back if any new or restarted systemd service fail.

I think that the most important thing is to make sure that a new deployment can be done. Then you're always able to fix your mistakes.

munnik avatar Mar 30 '25 14:03 munnik

The main reason for me is that not all the systems we maintain have permanent internet access.

I think comin is pretty good in this context since it polls repositories and deploy once it has conectivity.

I think that the most important thing is to make sure that a new deployment can be done. Then you're always able to fix your mistakes.

To be able to rollback, i'm currently using the testing feature: I switch to the testing branch and can reboot the machine to rollback since the bootloader has not been updated. To reduce the attack surface, I also like that the only why to interact with machines managed by comin are

Regarding an autorollback feature, comin runs on the machine itself, it cannot run a connectivity checks by itself: it would need to rely on a external service which could netcat the machine. Maybe something such as https://portchecker.co/ could be used. Maybe comin could run checking scripts (defining in Nix by the user/community) after the deployment. If the script fails, comin would rollback to the previous deployed generation.

I agree having an auto rollback feature would be really nice. But I currently don't feel a strong need for it and i don't know how it could look like. However, I would like to implement commands to manually rollback/pause comin with the CLI (comin pause and or comin rollback generation-UUID). Such kind of mechanisms could be a first step towards the auto rollback feature.

nlewo avatar Apr 04 '25 07:04 nlewo

Regarding an autorollback feature, comin runs on the machine itself, it cannot run a connectivity checks by itself: it would need to rely on a external service which could netcat the machine. Maybe something such as https://portchecker.co/ could be used. Maybe comin could run checking scripts (defining in Nix by the user/community) after the deployment. If the script fails, comin would rollback to the previous deployed generation.

Just being able to define a script that acts as liveliness probe would go a long way.

  • check processes are running
  • check the database connection
  • check a local url
  • maybe use some other external tool to check from outside

IMO external checks would be great but are somewhat optional.

Within the context of GitOps I am not sure I see the value of comin pause and or comin rollback. Could you elaborate how you imagine that to be used? How different would that be from rolling back the nixos config in git?

tcurdt avatar Apr 04 '25 08:04 tcurdt

Just being able to define a script that acts as liveliness probe would go a long way.

* check processes are running

* check the database connection

* check a local url

* maybe use some other external tool to check from outside

IMO external checks would be great but are somewhat optional.

I think, the idea of being able to have scripts to check if a rollback is needed is great. In my opinion, the default script should check if connectivity with (at least one) remote repository is still available. If yes, then a new commit can fix all other problems, if not then rollback the system. User can define other scripts themselves to check if a deployment was successful. I can imagine that one wants to check if ssh is still running, and you can still login from a remote system. How these scripts are implemented is up to the user. When all scripts have a 0 exit code, the deployment is successful, otherwise rollback.

munnik avatar Apr 04 '25 10:04 munnik

I'm currently reading through the code to see if I can build this feature. I realized that we should also define what will happen if a deployment will fail, should the system retry? If the repository goes down during deployment, then the check at the end will fail and a rollback happens. Normally, when the connection fails after a deployment, there is something wrong with the commit. Should the system retry after a certain amount of time? Or just give up and mark the commit as broken?

munnik avatar Apr 04 '25 13:04 munnik

I'm currently reading through the code to see if I can build this feature. I realized that we should also define what will happen if a deployment will fail, should the system retry? If the repository goes down during deployment, then the check at the end will fail and a rollback happens. Normally, when the connection fails after a deployment, there is something wrong with the commit. Should the system retry after a certain amount of time? Or just give up and mark the commit as broken?

I think ideally it should allow for a configurable number of retries. But even without retry it would be an improvement. The most interesting part is probably how to make the person in charge aware.

tcurdt avatar Apr 04 '25 13:04 tcurdt

Within the context of GitOps I am not sure I see the value of comin pause and or comin rollback. Could you elaborate how you imagine that to be used? How different would that be from rolling back the nixos config in git?

For the pause command, sometime i would like to avoid comin deploying new commits on my laptop for a short period of time. Currently, i stop the comin system service, but it would be more convenient to have a comin command. On servers, this should not be needed in theory, but in practice, i would not be surprised this could be useful for debugging sessions (you don't want to see the system reconfigured while debugging). Regarding the rollback command, this would allow a user to switch back to a previous generation when the new generation has an issue. Instead of having to commit to the repository, a user could just run comin rollback to fix the machine. Once the user has more time to debug, it would then be possible to push new commits to fix the issue. This could also be useful if your machine doesn't have internet access. Finally, a prometheus metric could be use to raise an alert when a machine is rollbacked and these metrics could be used to know if users encounter issues with the latest commit.

I realized that we should also define what will happen if a deployment will fail, should the system retry?

That would be nice and this is an old wanted feature ;) I don't think that would be too hard to implement in the manager.

The most interesting part is probably how to make the person in charge aware.

comin currently exposes Prometheus metrics. However this could not fit well with the comin pull model since the comin Prometheus endpoint needs to be exposed on Internet to be scraped (potential issues with firewalls, NAT, ...).

nlewo avatar Apr 07 '25 21:04 nlewo

Regarding the rollback command, this would allow a user to switch back to a previous generation when the new generation has an issue. Instead of having to commit to the repository, a user could just run comin rollback to fix the machine. Once the user has more time to debug, it would then be possible to push new commits to fix the issue

And how would you deal with the config drift? Would the deployment be halted on a rollback - similar to "pause"? Would the deploy pipeline need to be resumed after a rollback? And the drift would be reported via prometheus?

comin currently exposes Prometheus metrics. However this could not fit well with the comin pull model since the comin Prometheus endpoint needs to be exposed on Internet to be scraped

Well, this is usually been taken care of anyway. Be it a otel/grafana agent - or having auth on the prometheus metrics. So that I don't really see as a bad fit.

tcurdt avatar Apr 07 '25 22:04 tcurdt

Wanted to ask if this was something still being explored. Having the ability to add a custom liveliness probe script would be super helpful with managing unattended GitOps deployments to IoT devices running NixOS.

ProjectInitiative avatar Oct 25 '25 04:10 ProjectInitiative