flagger icon indicating copy to clipboard operation
flagger copied to clipboard

Canary is restarted to "Starting canary analysis for podinfo.test" on Confirm Traffic Increase fail

Open andylibrian opened this issue 3 years ago • 3 comments

Describe the bug

Given

    webhooks:
      - name: confirm-traffic
        url: http://flagger-loadtester.test/notfound  # notice /notfound
        timeout: 5s
        type: confirm-traffic-increase

The canary is always restarted over to "Starting canary analysis"

  Normal   Synced  7m59s (x4 over 8m29s)  flagger  Starting canary analysis for podinfo.test
  Warning  Synced  7m59s (x4 over 8m29s)  flagger  Halt podinfo.test advancement waiting for traffic increase approval

Thus the pre-rollout webhook is executed repeatedly.

To Reproduce

  1. Follow installation step in https://docs.flagger.app/tutorials/gloo-progressive-delivery
  2. Use this canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: gloo
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
  analysis:
    interval: 10s
    threshold: 5
    maxWeight: 50
    stepWeight: 5
    webhooks:
      - name: automata
        url: http://flagger-loadtester.test/notfound
        timeout: 5s
        type: confirm-traffic-increase

Expected behavior

As the docs says,

confirm-traffic-increase hooks are executed right before the weight on the canary is increased. The canary advancement is paused until this hook returns HTTP 200.

I would expect the canary is paused without being restarted to pre-rollout again.

Additional context

  • Flagger version: 1.22.1
  • Kubernetes version: 1.24.0
  • Service Mesh provider:
  • Ingress provider: Gloo

Notes

If this is a bug, I would be happy contribute a fix. I think the problem here is

The runConfirmTrafficIncreaseHooks on failure returns with the same Phase: Progressing. https://github.com/fluxcd/flagger/blob/f449ee18780d2496868596af8384da938dd9b9c1/pkg/controller/scheduler.go#L450-L452

Then in the next run, this condition is satisfied (canaryWeight: 0, status.Iterations: 0) https://github.com/fluxcd/flagger/blob/f449ee18780d2496868596af8384da938dd9b9c1/pkg/controller/scheduler.go#L401-L403

A possible solution could be to have a dedicated status for WaitingTrafficIncrease so that we can exclude this phase from restarting analysis, similar to CanaryPhaseWaitingPromotion.

	CanaryPhaseWaitingTrafficIncrease CanaryPhase = "WaitingTrafficIncrease"

andylibrian avatar Aug 03 '22 04:08 andylibrian

@andylibrian thanks for opening this issue and showing interest to contribute a fix, I'm assigning you this issue. Feel free to ask for any help/suggestions if needed, looking forward to your PR :)

aryan9600 avatar Aug 03 '22 08:08 aryan9600

@aryan9600 Sure. For the first step, I would like to get a comment / feedback from maintainers, whether what I described is a bug. Is my description on "Expected behavior" correct / agreed?

If so, is there any comment on my proposed solution? The one in the "Notes" section.

andylibrian avatar Aug 03 '22 09:08 andylibrian

Yes, the expected behavior is for analysis to halt till the hooks return a HTTP 2xx response. Yes it seems like having a dedicated phase for this is the way yo go.

aryan9600 avatar Aug 03 '22 10:08 aryan9600