flagger Thundering herd causes canary to be stuck in finalising

Describe the bug

When a canary for a deployment doing high rps is promoted to primary, it fails because it doesn't have enough replicas to handle the load.

To Reproduce

Apply a canary object for a high RPS deployment with stepWeight: 3, interval: 1 (3% every min)
Have a somewhat large difference between min and max for HPA. Eg. minReplicas: "3", maxReplicas: "30"
Observe the deployment

Expected behavior

The deployment should succeed

Actual behavior

Canary traffic shift happens successfully in around half an hour since it's 3% every minute
Canary deployment is scaled up by hpa as more traffic is shifted.
At the same time primary deployment is scaled down by the hpa for the same reason.
Canary is promoted to primary.
New primary fails because it can't handle the load.
Canary is stuck in Finalising state

This is because the new primary deployment replicas is only set when hpa ref is nil. This means the new primary deployment replica count will be set to hpa's min and since this is a small value, it cannot handle the load.

Additional context

Flagger version: 1.34
Kubernetes version: 1.27
Service Mesh provider: istio
Ingress provider: istio

Workarounds

Adjust stepWeightPromotion to make sure it does a partial traffic shift - Since this already done as part of canary, it seems redundant
Don't have low value of hpa min - This won't be ideal for workloads whose non peak traffic is low resulting in waste of resources
If stepWeightPromotion: 100 (or have another variable like promotionReplicas), primary replicas should be set to canary replicas - This seems logical but not sure how the hpa will react.

Feb 21 '24 06:02 shysank

Use stepWeightPromotion to progressively shift traffic back to the primary and thus avoiding thundering herd.

Docs: https://docs.flagger.app/usage/deployment-strategies#canary-release

Feb 21 '24 07:02 stefanprodan

@stefanprodan Thanks for the response. stepWeightPromotion is what we're planning to do. Unfortunately a side effect of this is, deployment time doubles for the same strategy without much benefits since we already know that the new build works, and even if didn't we cannot rollback at this point to an older deployment.

Would it make sense to set the new primary replicas to be the same as canary when stepweightPromotion is 100? Or have another variable like primaryReplicas? Happy to work on a patch if either of these makes sense.

Feb 21 '24 16:02 shysank