Canary is restarted to "Starting canary analysis for podinfo.test" on Confirm Traffic Increase fail
Describe the bug
Given
webhooks:
- name: confirm-traffic
url: http://flagger-loadtester.test/notfound # notice /notfound
timeout: 5s
type: confirm-traffic-increase
The canary is always restarted over to "Starting canary analysis"
Normal Synced 7m59s (x4 over 8m29s) flagger Starting canary analysis for podinfo.test
Warning Synced 7m59s (x4 over 8m29s) flagger Halt podinfo.test advancement waiting for traffic increase approval
Thus the pre-rollout webhook is executed repeatedly.
To Reproduce
- Follow installation step in https://docs.flagger.app/tutorials/gloo-progressive-delivery
- Use this canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo
namespace: test
spec:
provider: gloo
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
autoscalerRef:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
name: podinfo
service:
port: 9898
targetPort: 9898
analysis:
interval: 10s
threshold: 5
maxWeight: 50
stepWeight: 5
webhooks:
- name: automata
url: http://flagger-loadtester.test/notfound
timeout: 5s
type: confirm-traffic-increase
Expected behavior
As the docs says,
confirm-traffic-increase hooks are executed right before the weight on the canary is increased. The canary advancement is paused until this hook returns HTTP 200.
I would expect the canary is paused without being restarted to pre-rollout again.
Additional context
- Flagger version: 1.22.1
- Kubernetes version: 1.24.0
- Service Mesh provider:
- Ingress provider: Gloo
Notes
If this is a bug, I would be happy contribute a fix. I think the problem here is
The runConfirmTrafficIncreaseHooks on failure returns with the same Phase: Progressing.
https://github.com/fluxcd/flagger/blob/f449ee18780d2496868596af8384da938dd9b9c1/pkg/controller/scheduler.go#L450-L452
Then in the next run, this condition is satisfied (canaryWeight: 0, status.Iterations: 0) https://github.com/fluxcd/flagger/blob/f449ee18780d2496868596af8384da938dd9b9c1/pkg/controller/scheduler.go#L401-L403
A possible solution could be to have a dedicated status for WaitingTrafficIncrease so that we can exclude this phase from restarting analysis, similar to CanaryPhaseWaitingPromotion.
CanaryPhaseWaitingTrafficIncrease CanaryPhase = "WaitingTrafficIncrease"
@andylibrian thanks for opening this issue and showing interest to contribute a fix, I'm assigning you this issue. Feel free to ask for any help/suggestions if needed, looking forward to your PR :)
@aryan9600 Sure. For the first step, I would like to get a comment / feedback from maintainers, whether what I described is a bug. Is my description on "Expected behavior" correct / agreed?
If so, is there any comment on my proposed solution? The one in the "Notes" section.
Yes, the expected behavior is for analysis to halt till the hooks return a HTTP 2xx response. Yes it seems like having a dedicated phase for this is the way yo go.