argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

Rollback windows for fast-tracked rollbacks

Open jessesuen opened this issue 3 years ago • 5 comments

Spawned from https://github.com/argoproj/argo-rollouts/issues/557.

Currently we perform a "fast-tracked rollback" (which skips pauses, steps, analysis) in two circumstances:

  1. we detect if we are moving back to a blue-green ReplicaSet which exists and is still scaled up (within its scaleDownDelay)
  2. if we are moving back to the canary's "stable" ReplicaSet and the upgrade has not yet completed.

I think we should add more controls for intelligent, fast-tracked rollbacks for both the blue-green and canary strategies. For example, a use case is to have a fast-tracked rollback happen when moving to an older blue-green ReplicaSet even if it's scaled down (and not just if it's in the scaleDownDelay).

I think this can either be time based:

spec:
  rollbackWindow:
    duration: 24h
  strategy:
    blueGreen:
...

Or it could possibly be revision based (e.g. fast-rollback if we are moving to an n-2 revision):

spec:
  rollbackWindow:
    revisions: 2
  strategy:
    canary:
...

I propose to even place the rollbackWindow stanza outside strategy since both blue-green and canary would benefit from it.

jessesuen avatar Jul 08 '20 00:07 jessesuen

Have you considered the opposite situation, for example, an application needs to avoid multiple instances being started at the same time due to insufficient performance of the surrounding system, but sometimes we need to rollback during the canary. In this case, can we choose not to use "fast-tracked rollback"?

bysph avatar Jan 06 '22 08:01 bysph

@jessesuen this will be a good addition to the current argo rollouts. Are we tracking this against any release? thanks!

svissarapu avatar Jan 19 '22 20:01 svissarapu

Howdy! Is anyone working on that feature already? will that be part of the next release?

pragmaticivan avatar May 08 '22 04:05 pragmaticivan

Hi Y'All, Is anyone working on this feature? I see it's been open for 2 years now.

It is highly demanded on production environments where during incidents we need to rollback quickly to the previous stable version. Preanalysis and postanalysis runs have been already performed during a previous rollout, hence we do not need to make sure once again that the previous version is stable.

Could you add --full flag feature to "kubectl argo rollouts undo"?

$ kubectl argo rollouts promote --help | grep full
To skip analysis, pauses and steps entirely, use '--full' to fully promote the rollout
        kubectl argo rollouts promote guestbook --full
      --full   Perform a full promotion, skipping analysis, pauses, and steps

$ kubectl argo rollouts undo --help | grep full

The only way I can speed up the rollback now is to terminate prePromotionAnalysis and postPromotionAnalysis runs. But that takes the essential time of the incident.

I think I could automate the three steps (undo, terminate pre, terminate post) based on the output of "kubectl argo rollouts get rollout", however the feature would be benefitial for all users of Argo Rollouts.

sys-ops avatar Jul 28 '22 15:07 sys-ops

One workaround we are using right now is to leave a timed pause step at the end of the Rollout. We manually terminate the analysis run so that the only way we will roll back is if someone manually triggers it.

This works but it means that the "new" version is still marked as canary during that period, even though it is receiving 100% of traffic. I like the original idea proposed above but even if we could set the scaledown delay to a long period after the rollout is complete and allow quick rollback to that, it would be helpful.

bpoland avatar Jul 29 '22 14:07 bpoland

@jessesuen we're also interested in this feature. I can take a stab at implementing it, can you give me some hints where should this be?

alexef avatar Oct 27 '22 10:10 alexef