Canary didn't timeout waiting for canary pods
Describe the bug
A clear and concise description of what the bug is. Please provide the Canary definition and Flagger logs.
We observed an issue where a canary never timed out and was stuck in progressing while it was waiting for the 2 canary pods to become ready. The canary pods started but only managed to start 3/4 containers on each pod and then the pods (2) went into crashLoopBackoff.
Flagger logs show it repeating the same message for 2 hours even through the progressDeadlineSeconds is set to 180 so we expected the canary to fail after 3 minutes.
Canary definition
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
labels:
app: service-name
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: version
helm.sh/chart: service-name-version
tags.datadoghq.com/service: service-name
tags.datadoghq.com/version: version
name: service-name
namespace: cluster
spec:
analysis:
alerts:
- name: service-name team-deployments
providerRef:
name: service-name-team-deployments
namespace: flagger
severity: info
canaryReadyThreshold: 100
interval: 30s
maxWeight: 50
metrics:
- interval: 1m
name: service-name-container-cpu-usage
templateRef:
name: service-name-container-cpu-usage
namespace: flagger
templateVariables:
clusterName: cluster
thresholdRange:
max: 95
- interval: 1m
name: service-name-container-memory-usage
templateRef:
name: service-name-container-memory-usage
namespace: flagger
templateVariables:
clusterName: cluster
thresholdRange:
max: 95
- interval: 1m
name: service-name-container-restarts
templateRef:
name: service-name-container-restarts
namespace: flagger
templateVariables:
clusterName: cluster
thresholdRange:
max: 10
- interval: 1m
name: service-name-service-error-rate
templateRef:
name: service-name-service-error-rate
namespace: flagger
templateVariables:
clusterName: cluster
thresholdRange:
max: 1
primaryReadyThreshold: 100
stepWeight: 10
threshold: 5
webhooks:
- metadata:
cmd: curl -s --fail --show-error http://service-name-canary.cluster:port//status/readiness;
sleep 10
type: bash
name: acceptance-test
timeout: 30s
type: pre-rollout
url: http://flagger-loadtester.flagger/
- metadata:
cmd: hey -z 1m -q 10 -c 2 http://service-name-canary.cluster:port//status/readiness
name: load-test
timeout: 15s
type: rollout
url: http://flagger-loadtester.flagger/
- name: events
type: event
url: http://flagger-metric-consumer.flagger.svc.cluster.local:port/metrics
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: service-name
primaryScalerReplicas:
maxReplicas: 8
minReplicas: 2
progressDeadlineSeconds: 180
revertOnDeletion: false
service:
name: service-name
port: port
portDiscovery: true
timeout: 600s
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
skipAnalysis: false
targetRef:
apiVersion: apps/v1
kind: Deployment
name: service-name
Describe of canary
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Synced 52m flagger canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 out of 2 new replicas have been updated
Warning Synced 52m flagger canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 out of 2 new replicas have been updated
Warning Synced 2m25s (x171 over 18h) flagger canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available
Warning Synced 2m25s (x171 over 18h) flagger canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available
Flagger logs Repeating the same messages for 2 hours
{"level":"info","ts":"2024-01-11T11:09:45.125Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}
{"level":"info","ts":"2024-01-11T11:10:15.187Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 1 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}
{"level":"info","ts":"2024-01-11T11:10:45.178Z","caller":"controller/events.go:45","msg":"canary deployment service-name.namespace not ready: waiting for rollout to finish: 0 of 2 (readyThreshold 100%) updated replicas are available","canary":"service-name.namespace"}
To Reproduce
Introduce a change to stop a pod from fully starting Wait for the pods to go into crashLoopBackoff Observe the timeout period exceed
Expected behavior
Expectation for flagger to fail the canary once the progressDeadline has been exceeded which in this case was 3 minutes.
Additional context
- Flagger version: 1.34.0
- Kubernetes version: 1.25.16
- Service Mesh provider: Istio
- Ingress provider: Istio