flagger icon indicating copy to clipboard operation
flagger copied to clipboard

Canary didn't timeout waiting for canary pods

Open lcooper01 opened this issue 1 year ago • 0 comments

Describe the bug

A clear and concise description of what the bug is. Please provide the Canary definition and Flagger logs.

We observed an issue where a canary never timed out and was stuck in progressing while it was waiting for the 2 canary pods to become ready. The canary pods started but only managed to start 3/4 containers on each pod and then the pods (2) went into crashLoopBackoff.

Flagger logs show it repeating the same message for 2 hours even through the progressDeadlineSeconds is set to 180 so we expected the canary to fail after 3 minutes.

Canary definition

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  labels:
    app: service-name
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: version
    helm.sh/chart: service-name-version
    tags.datadoghq.com/service: service-name
    tags.datadoghq.com/version: version
  name: service-name
  namespace: cluster
spec:
  analysis:
    alerts:
    - name: service-name team-deployments
      providerRef:
        name: service-name-team-deployments
        namespace: flagger
      severity: info
    canaryReadyThreshold: 100
    interval: 30s
    maxWeight: 50
    metrics:
    - interval: 1m
      name: service-name-container-cpu-usage
      templateRef:
        name: service-name-container-cpu-usage
        namespace: flagger
      templateVariables:
        clusterName: cluster
      thresholdRange:
        max: 95
    - interval: 1m
      name: service-name-container-memory-usage
      templateRef:
        name: service-name-container-memory-usage
        namespace: flagger
      templateVariables:
        clusterName: cluster
      thresholdRange:
        max: 95
    - interval: 1m
      name: service-name-container-restarts
      templateRef:
        name: service-name-container-restarts
        namespace: flagger
      templateVariables:
        clusterName: cluster
      thresholdRange:
        max: 10
    - interval: 1m
      name: service-name-service-error-rate
      templateRef:
        name: service-name-service-error-rate
        namespace: flagger
      templateVariables:
        clusterName: cluster
      thresholdRange:
        max: 1
    primaryReadyThreshold: 100
    stepWeight: 10
    threshold: 5
    webhooks:
    - metadata:
        cmd: curl -s --fail --show-error http://service-name-canary.cluster:port//status/readiness;
          sleep 10
        type: bash
      name: acceptance-test
      timeout: 30s
      type: pre-rollout
      url: http://flagger-loadtester.flagger/
    - metadata:
        cmd: hey -z 1m -q 10 -c 2 http://service-name-canary.cluster:port//status/readiness
      name: load-test
      timeout: 15s
      type: rollout
      url: http://flagger-loadtester.flagger/
    - name: events
      type: event
      url: http://flagger-metric-consumer.flagger.svc.cluster.local:port/metrics
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: service-name
    primaryScalerReplicas:
      maxReplicas: 8
      minReplicas: 2
  progressDeadlineSeconds: 180
  revertOnDeletion: false
  service:
    name: service-name
    port: port
    portDiscovery: true
    timeout: 600s
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL
  skipAnalysis: false
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: service-name

Describe of canary

Events:
  Type     Reason  Age                    From     Message
  ----     ------  ----                   ----     -------
  Warning  Synced  52m                    flagger  canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 out of 2 new replicas have been updated
  Warning  Synced  52m                    flagger  canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 out of 2 new replicas have been updated
  Warning  Synced  2m25s (x171 over 18h)  flagger  canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available
  Warning  Synced  2m25s (x171 over 18h)  flagger  canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available

Flagger logs Repeating the same messages for 2 hours

{"level":"info","ts":"2024-01-11T11:09:45.125Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}
{"level":"info","ts":"2024-01-11T11:10:15.187Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}
{"level":"info","ts":"2024-01-11T11:10:45.178Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}

To Reproduce

Introduce a change to stop a pod from fully starting Wait for the pods to go into crashLoopBackoff Observe the timeout period exceed

Expected behavior

Expectation for flagger to fail the canary once the progressDeadline has been exceeded which in this case was 3 minutes.

Additional context

  • Flagger version: 1.34.0
  • Kubernetes version: 1.25.16
  • Service Mesh provider: Istio
  • Ingress provider: Istio

lcooper01 avatar Jan 11 '24 16:01 lcooper01