Canary status stuck in WaitingPromotion
Canary status stuck in WaitingPromotion for a long duration.
Canary status is stuck in WaitingPromotion status for more than hours with the message Halt sampleapp.testnamespace advancement waiting for promotion approval pre-rollout, where in the canary manifest, we have mentioned timeout for 2-3 min, even after 2-3 min if webhook doesn't return 200 response we expect the canary status to mark as failed. But the canary status is stuck in WaitingPromotion status.
I have tried to use webhook of the type confirm-promotion and pre-rollout for this test testing still status is stuck on WaitingPromotion status.
To Reproduce
Deploy a new change. Once the canary load test is successful, the webhook return 200 then roll out the changes to primary pods(Its working) If the webhook doesn't return 200 within a certain period of time(2m timeout we set in our case), it should timeout and mark the canary as failed status. (NOt working)
Below is the sample canary yaml file used
apiVersion:` flagger.app/v1beta1
kind: Canary
metadata:
name: sampleapp-sampleapp
namespace: testnamespace
spec:
analysis:
interval: 1m
maxWeight: 40
metrics:
- interval: 15s
name: 2xx 3xx percentage
templateRef:
name: sampleapp
namespace: testnamespace
thresholdRange:
min: 80
stepWeight: 10
threshold: 3
webhooks:
- metadata:
type: canary-deployment
name: pre-rollout
timeout: 2m
type: confirm-promotion
url: <API endpoint which returns 200 if exist>
- metadata:
cmd: >-
hey -z 20m -q 10 -c 2
<sampleapp endpoint>
name: load-test
timeout: 5s
url: http://flagger-loadtester.flagger/
autoscalerRef:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
name: sampleapp-sampleapp
progressDeadlineSeconds: 600
service:
appProtocol: TCP
gateways:
- default/cobalt-ingressgateway
headers:
response:
set:
Strict-Transport-Security: max-age=31536000; includeSubDomains
match:
- uri:
prefix: /sampleapp-qa/
- uri:
prefix: /sampleapp-sampleapp-qa/
- uri:
prefix: /sampleapp-qa/
name: sampleapp-sampleapp
port: 80
portDiscovery: true
portName: sampleapp-port
rewrite:
uri: /
targetPort: 80
timeout: 10s
trafficPolicy:
tls:
mode: DISABLE
skipAnalysis: false
targetRef:
apiVersion: apps/v1
kind: Deployment
name: sampleapp-sampleapp**
Below is the canary event
Normal Synced 28m (x2 over 3d22h) flagger New revision detected! Restarting analysis for sampleapp-sampleapptestnamespace
Warning Synced 24m (x4 over 27m) flagger canary deployment sampleapp-sampleapp.testnamespace not ready: waiting for rollout to finish: 1 old replicas are pending termination
Normal Synced 23m (x5 over 3d22h) flagger Starting canary analysis for sampleapp-sampleapp.testnamespace
Normal Synced 23m (x5 over 3d22h) flagger Advance sampleapp-sampleapp.testnamespacecanary weight 10
Normal Synced 22m (x5 over 3d22h) flagger Advance sampleapp-sampleapp.testnamespace canary weight 20
Normal Synced 21m (x5 over 3d22h) flagger Advance sampleapp-sampleapp.testnamespace canary weight 30
Normal Synced 20m (x5 over 3d22h) flagger Advance sampleapp-sampleapp.testnamespace canary weight 40
Warning Synced 19m flagger Halt sampleapp-sampleapp.testnamespace advancement waiting for promotion approval pre-rollout**
Expected behavior
If webhook doesn't return 200 without timeout set mark the canary as failed status
Can someone provide input on this issue? Is this expected behavior or some config I have done is wrong ? or does This need to fixed from code side?
this is expected behavior, from https://fluxcd.io/flagger/usage/webhooks/:
confirm-promotion hooks are executed before the promotion step. The canary promotion is paused until the hooks return HTTP 200. While the promotion is paused, Flagger will continue to run the metrics checks and rollout hooks.
if you want to rollback, then specify another webhook of type rollback and make the webhook server return a response with a 2xx status code after the Canary is stuck at WaitingPromotion after your desired timeout.
@aryan9600 If canary stuck in state Promoting, how do I make Canary fail ?