linkerd2
linkerd2 copied to clipboard
PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels
What is the issue?
When using something like mimir
for long-term metric retention, the podMonitor's metrics are scraped by Prometheus and directly sent to mimir. Mimir will reject massive amounts of the metrics with the following.
failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested
Mimir Docs: https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-sample-duplicate-timestamp
Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.
Disabling this podmonitor, stops the errors.
How can it be reproduced?
- Install prometheus
- Install mimir
- Tell prometheus to remote write to mimir
Logs, error output, etc
{
"caller": "dedupe.go:112",
"component": "remote",
"count": 2000,
"err": "server returned HTTP status 400 Bad Request: failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-05-14T22:16:17.61Z and is from series tcp_close_total{app_kubernetes_io_instance=\"kube-prometheus-stack-prometheus\", app_kubernetes_io_managed_by=\"prometheus-operator\", app_kubernetes_io_name=\"prometheus\", app_kubernetes_io_version=\"2.51.2\", apps_kubernetes_io_pod_index=\"0\", container=\"linkerd-proxy\", control_plane_ns=\"linkerd\", controller_revision_hash=\"prometheus-kube-prometheus-stack-prometheus-647889d8c\", direction=\"outbound\", dst_control_plane_ns=\"linkerd\", dst_daemonset=\"promtail\", dst_namespace=\"promtail\", dst_pod=\"promtail-f4kms\", dst_serviceaccount=\"promtail\", instance=\"10.2.25.220:4191\", job=\"linkerd/linkerd-proxy\", namespace=\"monitoring\", operator_prometheus_io_name=\"kube-prometheus-stack-prometheus\", operator_promethe",
"exemplarCount": 0,
"level": "error",
"msg": "non-recoverable error",
"remote_name": "2cbc3b",
"ts": "2024-05-14T22:16:19.070Z",
"url": "http://mimir-distributed-nginx.mimir.svc:80/api/v1/push"
}
output of linkerd check -o short
❯ linkerd check -o short
linkerd-config
--------------
× control plane CustomResourceDefinitions exist
missing grpcroutes.gateway.networking.k8s.io
see https://linkerd.io/2/checks/#l5d-existence-crd for hints
linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
some proxies are not running the current version:
* jaeger-injector-7566699689-44tfd (stable-2.14.10)
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
jaeger-injector-7566699689-44tfd running stable-2.14.10 but cli running edge-24.5.2
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-7fd4bb899-5wczd (edge-24.5.1)
* metrics-api-7fd4bb899-srcxk (edge-24.5.1)
* tap-988849cc4-5drh4 (edge-24.5.1)
* tap-988849cc4-htdg5 (edge-24.5.1)
* tap-injector-84f85cb756-gglv7 (edge-24.5.1)
* tap-injector-84f85cb756-zhs2n (edge-24.5.1)
* web-5d484bb4f-xvzfs (edge-24.5.1)
* web-5d484bb4f-zmfbh (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-7fd4bb899-5wczd running edge-24.5.1 but cli running edge-24.5.2
see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
‼ prometheus is installed and configured correctly
missing ClusterRoles: linkerd-linkerd-viz-prometheus
see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints
Status check results are ×
Environment
EKS 1.28
Possible solution
I honestly do not know enough about prometheus metric relabeling, but I can indicate that of the 40+ servicemonitors we have, only this specific podMonitor causes the errors.
Additional context
No response
Would you like to work on fixing this bug?
no