flagger
flagger copied to clipboard
Thundering herd causes canary to be stuck in finalising
Describe the bug
When a canary for a deployment doing high rps is promoted to primary, it fails because it doesn't have enough replicas to handle the load.
To Reproduce
- Apply a canary object for a high RPS deployment with
stepWeight: 3, interval: 1 (3% every min) - Have a somewhat large difference between min and max for HPA. Eg.
minReplicas: "3", maxReplicas: "30" - Observe the deployment
Expected behavior
The deployment should succeed
Actual behavior
- Canary traffic shift happens successfully in around half an hour since it's 3% every minute
- Canary deployment is scaled up by hpa as more traffic is shifted.
- At the same time primary deployment is scaled down by the hpa for the same reason.
- Canary is promoted to primary.
- New primary fails because it can't handle the load.
- Canary is stuck in
Finalisingstate
This is because the new primary deployment replicas is only set when hpa ref is nil. This means the new primary deployment replica count will be set to hpa's min and since this is a small value, it cannot handle the load.
Additional context
- Flagger version: 1.34
- Kubernetes version: 1.27
- Service Mesh provider: istio
- Ingress provider: istio
Workarounds
- Adjust
stepWeightPromotionto make sure it does a partial traffic shift - Since this already done as part of canary, it seems redundant - Don't have low value of hpa min - This won't be ideal for workloads whose non peak traffic is low resulting in waste of resources
- If
stepWeightPromotion: 100 (or have another variable like promotionReplicas), primary replicas should be set to canary replicas - This seems logical but not sure how the hpa will react.
Use stepWeightPromotion to progressively shift traffic back to the primary and thus avoiding thundering herd.
Docs: https://docs.flagger.app/usage/deployment-strategies#canary-release
@stefanprodan Thanks for the response. stepWeightPromotion is what we're planning to do. Unfortunately a side effect of this is, deployment time doubles for the same strategy without much benefits since we already know that the new build works, and even if didn't we cannot rollback at this point to an older deployment.
Would it make sense to set the new primary replicas to be the same as canary when stepweightPromotion is 100? Or have another variable like primaryReplicas? Happy to work on a patch if either of these makes sense.