Duplicate samples for customResourceState metrics
What happened:
We upgraded to Prometheus 2.52 and started receiving the following warnings:
ts=2024-07-11T06:43:56.289Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/k8s-monitoring/kube-prometheus-stack-kube-state-metrics/0 target=http://x.x.x.x:8080/metrics msg="Error on inge
sting samples with different value but same timestamp" num_dropped=23
We found similar open issues about duplicates, but this one is about all the metrics configured via customResourceState config.
After a fresh restart of the kube-state-metrics pod, the metrics are not duplicated. However after a while, each of the metrics, configured via customResourceState is suddenly present twice or even multiple times:
# There is exactly 1 kind: HelmRepository
$ kubectl get helmrepositories.source.toolkit.fluxcd.io -n flux-system
NAME URL AGE
acraks oci://xxxx.azurecr.io 8d
# After kube-state-metrics runs a while, it returns 3 exact same metrics
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
# After a restart of kube-state-metrics, there are no duplications for a while
$ kubectl delete pod kube-prometheus-stack-kube-state-metrics-76968f786b-z7m8t
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
What you expected to happen:
No duplicates, as the resource exists just once and all labels are the same.
How to reproduce it (as minimally and precisely as possible):
Use the configuration provided here: https://fluxcd.io/flux/monitoring/custom-metrics or the customResourceState config below:
apiVersion: v1
data:
config.yaml: |
spec:
resources:
- groupVersionKind:
group: source.toolkit.fluxcd.io
kind: HelmRepository
version: v1
metricNamePrefix: gotk
metrics:
- each:
info:
labelsFromPath:
name:
- metadata
- name
type: Info
help: The current state of a Flux HelmRepository resource.
labelsFromPath:
exported_namespace:
- metadata
- namespace
ready:
- status
- conditions
- '[type=Ready]'
- status
revision:
- status
- artifact
- revision
suspended:
- spec
- suspend
url:
- spec
- url
name: resource_info
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/version: 2.12.0
helm.sh/chart: kube-state-metrics-5.20.0
helm.toolkit.fluxcd.io/name: kube-prometheus-stack
helm.toolkit.fluxcd.io/namespace: flux-system
release: kube-prometheus-stack
name: kube-prometheus-stack-kube-state-metrics-customresourcestate-config
namespace: k8s-monitoring
Anything else we need to know?:
Environment:
- kube-state-metrics version: 2.12.0
- Kubernetes version (use
kubectl version): 1.29.4 - Cloud provider or hardware configuration: AKS
I can confirm this. The problem only occurs after some time.
Can confirm this bug to still be present. Since this Application is bundled with kube-prometheus-stack it would be nice to get an update. There is even a PR that was closed by the bot rather than merged.
/assign @rexagod /triage accepted
Just confirming that this is an issue and the PR looks like a promising and dire needed fix. KSM metrics output is invalid after CR updates which is quite severe for us.
Thanks for already bringing up a PR \o/
Hi, @speer Is this issue fixed for you? We are running KSM v2.15.0, which contains this fix, but we are still getting duplicxated samples.
Configuration used:
apiVersion: v1
data:
config.yaml: |
kind: CustomResourceStateMetrics
spec:
resources:
...
- errorLogV: 0
groupVersionKind:
group: serverless.kyma-project.io
kind: Function
version: v1alpha2
labelsFromPath:
name:
- metadata
- name
namespace:
- metadata
- namespace
metrics:
- commonLabels:
type: ConfigurationReady
each:
gauge:
labelsFromPath:
reason:
- reason
nilIsZero: true
path:
- status
- conditions
- '[type=ConfigurationReady]'
valueFrom:
- status
type: Gauge
help: function condition
name: function_condition
Hi @JackCheng01 The issue did not re-occur on our side.