kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

Duplicate samples for customResourceState metrics

Open speer opened this issue 1 year ago • 3 comments

What happened:

We upgraded to Prometheus 2.52 and started receiving the following warnings:

ts=2024-07-11T06:43:56.289Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/k8s-monitoring/kube-prometheus-stack-kube-state-metrics/0 target=http://x.x.x.x:8080/metrics msg="Error on inge
sting samples with different value but same timestamp" num_dropped=23

We found similar open issues about duplicates, but this one is about all the metrics configured via customResourceState config.

After a fresh restart of the kube-state-metrics pod, the metrics are not duplicated. However after a while, each of the metrics, configured via customResourceState is suddenly present twice or even multiple times:

# There is exactly 1 kind: HelmRepository
$ kubectl get helmrepositories.source.toolkit.fluxcd.io -n flux-system
NAME     URL                     AGE
acraks   oci://xxxx.azurecr.io   8d

# After kube-state-metrics runs a while, it returns 3 exact same metrics
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1

# After a restart of kube-state-metrics, there are no duplications for a while
$ kubectl delete pod kube-prometheus-stack-kube-state-metrics-76968f786b-z7m8t
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1

What you expected to happen:

No duplicates, as the resource exists just once and all labels are the same.

How to reproduce it (as minimally and precisely as possible):

Use the configuration provided here: https://fluxcd.io/flux/monitoring/custom-metrics or the customResourceState config below:

apiVersion: v1
data:
  config.yaml: |
    spec:
      resources:
      - groupVersionKind:
          group: source.toolkit.fluxcd.io
          kind: HelmRepository
          version: v1
        metricNamePrefix: gotk
        metrics:
        - each:
            info:
              labelsFromPath:
                name:
                - metadata
                - name
            type: Info
          help: The current state of a Flux HelmRepository resource.
          labelsFromPath:
            exported_namespace:
            - metadata
            - namespace
            ready:
            - status
            - conditions
            - '[type=Ready]'
            - status
            revision:
            - status
            - artifact
            - revision
            suspended:
            - spec
            - suspend
            url:
            - spec
            - url
          name: resource_info
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.12.0
    helm.sh/chart: kube-state-metrics-5.20.0
    helm.toolkit.fluxcd.io/name: kube-prometheus-stack
    helm.toolkit.fluxcd.io/namespace: flux-system
    release: kube-prometheus-stack
  name: kube-prometheus-stack-kube-state-metrics-customresourcestate-config
  namespace: k8s-monitoring

Anything else we need to know?:

Environment:

  • kube-state-metrics version: 2.12.0
  • Kubernetes version (use kubectl version): 1.29.4
  • Cloud provider or hardware configuration: AKS

speer avatar Jul 12 '24 12:07 speer

I can confirm this. The problem only occurs after some time.

fischerman avatar Jul 17 '24 08:07 fischerman

Can confirm this bug to still be present. Since this Application is bundled with kube-prometheus-stack it would be nice to get an update. There is even a PR that was closed by the bot rather than merged.

Toasterson avatar Aug 08 '24 15:08 Toasterson

/assign @rexagod /triage accepted

dgrisonnet avatar Aug 08 '24 16:08 dgrisonnet

Just confirming that this is an issue and the PR looks like a promising and dire needed fix. KSM metrics output is invalid after CR updates which is quite severe for us.

Thanks for already bringing up a PR \o/

m3co-code avatar Sep 25 '24 12:09 m3co-code

Hi, @speer Is this issue fixed for you? We are running KSM v2.15.0, which contains this fix, but we are still getting duplicxated samples.

Configuration used:

apiVersion: v1
data:
  config.yaml: |
    kind: CustomResourceStateMetrics
    spec:
      resources:
...
      - errorLogV: 0
        groupVersionKind:
          group: serverless.kyma-project.io
          kind: Function
          version: v1alpha2
        labelsFromPath:
          name:
          - metadata
          - name
          namespace:
          - metadata
          - namespace
        metrics:
        - commonLabels:
            type: ConfigurationReady
          each:
            gauge:
              labelsFromPath:
                reason:
                - reason
              nilIsZero: true
              path:
              - status
              - conditions
              - '[type=ConfigurationReady]'
              valueFrom:
              - status
            type: Gauge
          help: function condition
          name: function_condition

JackCheng01 avatar Apr 03 '25 02:04 JackCheng01

Hi @JackCheng01 The issue did not re-occur on our side.

speer avatar Apr 03 '25 09:04 speer