linkerd2 PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels

PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels

Open jseiser opened this issue 9 months ago • 1 comments

What is the issue?

When using something like mimir for long-term metric retention, the podMonitor's metrics are scraped by Prometheus and directly sent to mimir. Mimir will reject massive amounts of the metrics with the following.

failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested

Mimir Docs: https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-sample-duplicate-timestamp

Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.

Disabling this podmonitor, stops the errors.

How can it be reproduced?

Install prometheus
Install mimir
Tell prometheus to remote write to mimir

Logs, error output, etc

{
  "caller": "dedupe.go:112",
  "component": "remote",
  "count": 2000,
  "err": "server returned HTTP status 400 Bad Request: failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-05-14T22:16:17.61Z and is from series tcp_close_total{app_kubernetes_io_instance=\"kube-prometheus-stack-prometheus\", app_kubernetes_io_managed_by=\"prometheus-operator\", app_kubernetes_io_name=\"prometheus\", app_kubernetes_io_version=\"2.51.2\", apps_kubernetes_io_pod_index=\"0\", container=\"linkerd-proxy\", control_plane_ns=\"linkerd\", controller_revision_hash=\"prometheus-kube-prometheus-stack-prometheus-647889d8c\", direction=\"outbound\", dst_control_plane_ns=\"linkerd\", dst_daemonset=\"promtail\", dst_namespace=\"promtail\", dst_pod=\"promtail-f4kms\", dst_serviceaccount=\"promtail\", instance=\"10.2.25.220:4191\", job=\"linkerd/linkerd-proxy\", namespace=\"monitoring\", operator_prometheus_io_name=\"kube-prometheus-stack-prometheus\", operator_promethe",
  "exemplarCount": 0,
  "level": "error",
  "msg": "non-recoverable error",
  "remote_name": "2cbc3b",
  "ts": "2024-05-14T22:16:19.070Z",
  "url": "http://mimir-distributed-nginx.mimir.svc:80/api/v1/push"
}

output of `linkerd check -o short`

❯ linkerd check -o short
linkerd-config
--------------
× control plane CustomResourceDefinitions exist
    missing grpcroutes.gateway.networking.k8s.io
    see https://linkerd.io/2/checks/#l5d-existence-crd for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
        * jaeger-injector-7566699689-44tfd (stable-2.14.10)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
    jaeger-injector-7566699689-44tfd running stable-2.14.10 but cli running edge-24.5.2
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-7fd4bb899-5wczd (edge-24.5.1)
        * metrics-api-7fd4bb899-srcxk (edge-24.5.1)
        * tap-988849cc4-5drh4 (edge-24.5.1)
        * tap-988849cc4-htdg5 (edge-24.5.1)
        * tap-injector-84f85cb756-gglv7 (edge-24.5.1)
        * tap-injector-84f85cb756-zhs2n (edge-24.5.1)
        * web-5d484bb4f-xvzfs (edge-24.5.1)
        * web-5d484bb4f-zmfbh (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-7fd4bb899-5wczd running edge-24.5.1 but cli running edge-24.5.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints

Status check results are ×

Environment

EKS 1.28

Possible solution

I honestly do not know enough about prometheus metric relabeling, but I can indicate that of the 40+ servicemonitors we have, only this specific podMonitor causes the errors.

Additional context

No response

Would you like to work on fixing this bug?

May 14 '24 23:05 jseiser

linkerd2 linkerd2 copied to clipboard

PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`