kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Feature: Stage-level auto-promotion "pausing"

Open jcantwell-JC opened this issue 9 months ago • 6 comments

Proposed Feature

We propose 2 features:

  1. Pause auto-promotion of individual stages from the UI, and with less security permissions than turning on autoPromotionEnabled at the project level.
  2. Per-stage pause auto-promotions on failure option.

Motivation

Our use case is that we have many engineers who are creating new versions of an application, and in many cases they aren't all 100% aware of what the other teams are doing. We'd like a CI pipeline that can safely auto-deploy their code from PR merge all the way through to production.

One edge case we're worried about is when a failure or incident occurs. When this happens, we would like to stop auto-promotions in case engineers who are unaware of the incident are still merging PRs. In particular, we don't want newer freights to just ignore the promotion/verification failure and deploy right over it, as a new release could have very negative consequences for the ongoing incident.

We are aware that we could probably write tooling to toggle autoPromotionEnabled at the project level, but this would require our engineers to now manage deploys in multiple places, and can create some confusion on how to resume, etc.

Suggested Implementation

Based on my review of the code, I think the following changes could be made:

  • Add a autoPromotionPaused state to stages
  • Add UI to toggle this setting if autoPromotionEnabled is set for that stage
  • When fetching the next freight for a stage, consider autoPromotionEnabled and the new autoPromotionPaused state.
  • Finally, offer a pauseAutoPromotionOnFailure, and when a promotion or verification failure is detected, simply toggle autoPromotionPaused: true

jcantwell-JC avatar Mar 21 '25 15:03 jcantwell-JC

There are security reasons that this wasn't configurable at the Stage level to begin with, but we know it's currently tedious to toggle this on/off for specific Stages. Project settings are likely getting a major overhaul somewhat soon and I believe when that's done, this will be far less onerous without further modifications.

See #3669, #3670, and #3685

krancour avatar Mar 24 '25 22:03 krancour

It makes sense that project level enabling of auto-promotion makes sense for security reasons. We were thinking that stage-level pause/resume was a good operational compromise, as it would not allow auto-promotion to be enabled, but would allow it to be turned off temporarily if enabled at the project level. A good balance of project level security and operational level flexibility/safety.

Additionally, even if/when project settings are overhauled, having a concept of Enabled/Disabled be separate from Paused makes sense as a way to differentiate between a long term setting and a temporary state. We would see value in such functionality and would gladly implement it.

jcantwell-JC avatar Mar 25 '25 12:03 jcantwell-JC

I was thinking of opening a similar issue, but I saw this exists so I'll just post a comment here. I actually was thinking of having not a per stage but a per project pipeline stop button, think of it as a big red "press in case of emergency" button, which anyone can trigger. The use case was similar, either an incident is ongoing, or for example something known broken that happens to pass tests starts to propagate and we want to make sure it stops.

nikolay-te avatar Mar 25 '25 12:03 nikolay-te

@nikolay-te #1422 is close to what you're looking for. It says "system-level," but project-level is probably more realistic.

per project pipeline stop button

That's quite a bit harder because a "pipeline" isn't a real thing in Kargo. There are no Pipeline resources to which Warehouses and Stages belong. Pipelines are solely a human construct that our minds, warped by years of CI/CD, can't help from overlaying on (possibly disconnected) segments of what is really a DAG.

So there's no place to flip such a toggle.

@jcantwell-JC you make a good point. It could make sense to "pause" at the Stage level when autopromotion is otherwise enabled, but not pausing at the stage level doesn't need to be tantamount to enabling. Logic could be auto-promote if enabled and not paused.

I would still like to see the issue(s) I linked to addressed before introducing something like this. What I want to avoid is, API-wise, taking a step in a direction that ends up either being inconsistent with how we address those other issues or, worse, forces us to make compromises, implement breaking changes, or end up stuck supporting deprecated/legacy configuration options that have been replaced with something more comprehensive.

krancour avatar Mar 25 '25 14:03 krancour

Thanks @krancour. Wanting a solid API makes sense. We will probably experiment with this internally and monitor these issues for a good time to perhaps re-raise this feature request (with a PR perhaps).

jcantwell-JC avatar Mar 26 '25 01:03 jcantwell-JC

@krancour As I've dug into the code more, there are two processes at play here during reconcile in a Stage:

  1. auto-promoting Freight that is creating a new Promotion for each available Freight as they become available. This does not mean the Freight will be promoted immediately. It gets add to the list.
  2. syncing Promotions that is looking at the list of Promotions and picking the next one to start (when the time is right).

Interestingly, I think what I really want with this feature is to pause promotions, not just auto-promotions. This ensures that even things that are queued up do not proceed either. You may have interpreted my original ask this way, even though I didn't understand it yet.

The interesting part here is that while technically the list of Promotions can contain Freights that were manually added, it is most likely a list to accommodate auto-promotions.

Anyway, my tl;dr; here is to ask if pausing all promotions on a stage was interesting, and if so does that also interplay with the project level settings or not? If not, would we consider adding some type of PausePromotions API onto a Stage that is technically independent from auto-promotions?

jcantwell-JC avatar Apr 02 '25 02:04 jcantwell-JC

This issue has been automatically marked as stale because it had no activity for 90 days. It will be closed if no activity occurs in the next 30 days but can be reopened if it becomes relevant again.

github-actions[bot] avatar Jul 01 '25 11:07 github-actions[bot]

This issue has been automatically marked as stale because it had no activity for 90 days. It will be closed if no activity occurs in the next 30 days but can be reopened if it becomes relevant again.

github-actions[bot] avatar Oct 01 '25 11:10 github-actions[bot]

not stale, still needed

florianmutter avatar Oct 01 '25 13:10 florianmutter