flagger icon indicating copy to clipboard operation
flagger copied to clipboard

Flagger ignores Datadog error thresholds, deploys failing versions

Open GurayCetin opened this issue 7 months ago • 0 comments

I'm testing a failure scenario by intentionally deploying a broken version of my application. According to my metric template and canary configuration, I expect Flagger to:

  • Detect the high error rates (visible in Datadog query results)
  • Fail the canary analysis checks
  • Halt the rollout of the broken version

Current Behavior: Despite clear evidence in Datadog showing error rates significantly above my configured threshold (1.1),

Flagger:

  • Shows incorrect metric values (flagger_canary_metric_analysis = 1)
  • Proceeds with deploying the broken version
  • Provides no visibility into how it arrived at this decision (no query results in controller logs)

Debugging Observations:

  • Datadog Verification: Raw query results show values up to 20 (far exceeding the 1.1 threshold)
  • Flagger Metrics: Internal metric shows 1 (which doesn't match Datadog observations)
  • Logging Gap: Controller logs show the executed query but not the actual returned values
  • Behavior: Canary progresses when it should fail

I don't see any query result in flagger controller to decide it pass or fail, it's just printing the query which run on datadog.

{"level":"debug","ts":"2025-07-29T11:38:08.241Z","caller":"controller/scheduler_metrics.go:309","msg":"Metric template error-rate.buy query: clamp_min(\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development AND reporter:destination AND destination_app:giftcard AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development,reporter:destination,destination_app:giftcard}.as_count(),\n 0.05\n) / clamp_min(\n sum:istio.mesh.request.count.total{env:development AND reporter:destination AND destinat │ │ │giftc │ │ ││ ion_app:giftcard-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{env:development,reporter:destination,destinat │ │ │ │ │ ││ ion_app:giftcard-primary}.as_count(),\n 0.05\n)","canary":"giftcard.buy"}

  • canary configuration for analysis; according to 1 value is expected from metric template analysis, if there is %10 error rate than primary, i expect to see a failure.
  analysis:
    interval: 3m
    threshold: 5
    maxWeight: 50
    stepWeights: [20, 50]
    metrics:
      - name: "error-rate"
        templateRef:
          name: "error-rate"
          namespace: buy
        interval: 3m
        thresholdRange:
          max: 1.1 -> %10 higher than 1 
  • metric template;

canary_error_rate / primary_error_rate if error is under %5 using 0.05 clamp_min fix the value 0.05 for both canary and primary and result is always 1 to promote.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
spec:
  provider:
    type: datadog
    address: "https://api.datadoghq.eu"
    secretRef:
      name: datadog
  query: |-
    clamp_min(
      sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }} AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
      sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}}.as_count(),
      0.05
    ) / clamp_min(
      sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }}-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
      sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}-primary}.as_count(),
      0.05
    )
  • query result is up to 20 as value; Image

  • flagger_canary_metric_analysis is 1 as value,not sure why it's is 1 which is different from query result Image

GurayCetin avatar Jul 29 '25 12:07 GurayCetin