kustomize-controller icon indicating copy to clipboard operation
kustomize-controller copied to clipboard

`gotk_resource_info{ready="Unknown"}` during the reconciliation

Open taraspos opened this issue 6 months ago • 2 comments

Problem

gotk_resource_info metric has label of ready="Unknown" for deamonset that takes ~1h to rollout.

This triggers prometheus alert for resource being in non-ready state for prolonged period of time, however it is being successfully rolled out. See:

status:
  conditions:
  - lastTransitionTime: "2025-05-23T14:30:17Z"
    message: Running health checks for <redacted>
      with a timeout of 1h0m0s
    observedGeneration: 20
    reason: Progressing
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2025-05-23T14:30:16Z"
    message: Reconciliation in progress
    observedGeneration: 20
    reason: Progressing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2025-05-23T14:30:17Z"
    message: Running health checks for revision <redacted>
      with a timeout of 1h0m0s
    observedGeneration: 20
    reason: Progressing
    status: Unknown
    type: Healthy

Metric:

gotk_resource_info{
  ... <redacted> ...
  customresource_group="kustomize.toolkit.fluxcd.io",
  customresource_kind="Kustomization",
  customresource_version="v1",
  ready="Unknown",
  suspended="false",
  source_name="flux-system"
  ... <redacted> ...
  }

Expected behaviour

Metric has label like ready="Progressing", this way alert can be configured to not alert on progressing resources.

Configuration

  • Kustomization:

    apiVersion: kustomize.toolkit.fluxcd.io/v1
    kind: Kustomization
    metadata:
      name: <redacted>
      namespace: <redacted>
    spec:
      interval: 10m
      serviceAccountName: kustomize-controller
      sourceRef:
        kind: GitRepository
        name: flux-system
      path: <redacted>
      prune: true
      wait: true
      suspend: false
      timeout: 60m
      dependsOn: <redacted>
    
  • Alertmanager alert^1

            - alert: FluxCDResourceNotReady
              expr: gotk_resource_info{ready!="True"} > 0
              for: 15m
    

taraspos avatar May 23 '25 15:05 taraspos

You can change the metrics as you like in the kube-state-metrics config. If you prefer the reason instead of the status for ready, change the config to:

ready: [ status, conditions, "[type=Ready]", reason ]

https://github.com/fluxcd/flux2-monitoring-example/blob/4b0f96da1541309240b02a1e3e1116d93cb3e6d9/monitoring/controllers/kube-prometheus-stack/kube-state-metrics-config.yaml#L51

Or you can add a new metric for healthy and compare the two.

stefanprodan avatar May 23 '25 15:05 stefanprodan

Thanks a lot for a quick response. I will take a look at the provided example!

taraspos avatar May 23 '25 15:05 taraspos