kargo proposal: FailurePolicy / Automated Rollbacks

From https://github.com/akuity/kargo/issues/2968#issuecomment-2489188432

~~The exact conditions that precipitated this proposal were many Stages whose Promotion processes all attempt pushing to the same branch. Unsurprisingly, this can create races between concurrent Promotions. In the time between one Promo checking out the relevant branch and pushing a new commit to it, another Promotion may have pushed its own commit to that branch, thereby creating a conflict that causes the first Promotion's git-push step to fail.~~

~~This is one of many reasons I strongly promote using a dedicated branch per Stage as a sort of storage, but this issue isn't about the wisdom or folly of any particular approach. The scenario above is merely an accessible example of a Promotion failure that could be resolved simply by repeating the steps of the Promotion process again, starting from 0.~~

Edit: The above scenario has been dealt with through other means, but the proposal is generic and wasn't meant to deal only with that scenario. Read on.

With Promotion processes being entirely user-defined, it's not really possible to build any ~~intelligent~~ generic/"magic" recovery logic directly into ~~the git-push step~~ promotion process excecution. It seems, however, that there is a range of simple and generic "FailurePolicies" that could be quite useful.

Some ideas for further discussion:

Start the Promotion again from step 0 (retry up to some limit)
Let the Promotion fail then automatically create a new one just like it (retry up to some limit)
Do nothing
Let the Promotion fail then automatically create a new to return the Stage to its previous state (retry up to some limit)
Execute a user-defined recovery/cleanup process
Other...

Users could select a policy from these options and we can add more options over time.

Another complementary idea is for individual steps to be able to provide a hint in a failure result as to how best to proceed.

We've heard many ask for automatic rollbacks before, though we have no issue for it. I would propose that this notion of FailurePolicies might be the correct angle from which to approach that.

cc @jessesuen and @hiddeco for input.

Nov 20 '24 18:11 krancour

Let the Promotion fail then automatically create a new to return the Stage to its previous state

This is exactly what I'm looking for our use case, ideally if an AnalysisRun fails, kick off the failure policy that potentially points to its own Promotion with the previous artifacts passed through. Lot of ways to do it, but handling failures would be nice

Jan 27 '25 18:01 michaelasper

Because, as you mention, Promotions are entirely user-defined. I do wonder how successful we can be in automagically performing rollbacks.

One particular detail I am curious about is the fact that Promotions are created from a template at a point in time. I.e., if you promote Freight x, change your Promotion template, and then promote Freight y, this may not result in the expected outcome depending on the change you made to the template.

Given this, I do wonder if an idea which you did not mention should be included on the list:

Define an "on failure" series of steps within the Promotion template itself, which knows what to do on a failure. Because these steps would be self-aware of their own Promotion context, they could for example revert a Git commit they made.

Jan 29 '25 21:01 hiddeco

That seems like it fits into #3228

Feb 03 '25 14:02 krancour

Making a note that this proposal needs to expand its scope to include verification failures, not just promotion. Because a common use case is to go back to the previous version if their tests fail (not just promo steps).

Mar 19 '25 21:03 jessesuen

Want #3639 done first.

Apr 07 '25 21:04 krancour

Hey @krancour I saw this was removed from V1.6.0, do you mind share the timeline of supporting this feature?

Jun 18 '25 17:06 hli5-atl

This issue has been automatically marked as stale because it had no activity for 90 days. It will be closed if no activity occurs in the next 30 days but can be reopened if it becomes relevant again.

Nov 04 '25 11:11 github-actions[bot]

Given that #3639 is now done, what are we looking at for this feature? We're having to roll our own rollback system for this at the moment which feels very strange, like either we're doing something wrong or Kargo is missing a very important feature.

How are people using Kargo with rolling back on failure conditions today? Is it just going all in with e.g. Argo rollouts so the manifests remain the same but the underlying pods are rolled back or is there something else people are doing to achieve this?

Nov 04 '25 11:11 AmyJeanes

How are people using Kargo with rolling back on failure conditions today?

we have a external system that monitors stages status and if it detects problem, it rollbacks (via kargo api) all stages to previous known working version.

Nov 04 '25 11:11 mihuross