linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Duplicate metric on linkerd-proxy /metrics endpoint at linkerd injected pods after upgrade from 2025.2.1 to 2025.3.2.

Open mrtworo opened this issue 9 months ago • 5 comments

What is the issue?

Prometheus scraping linkerd-proxy metrics logs "Error on ingesting samples with different value but same timestamp" for various linkerd injected pods. When checking one of the affected targets /metrics endpoint it seems that inbound_http_authz_allow_total is duplicated, logs attached.

First warnings in the prometheus logs appeared right after upgrade from 2025.2.1 to 2025.3.2.

How can it be reproduced?

It looks like being triggered by upgrade from 2025.2.1 to 2025.3.2. First log entries with the problem appeared right after new version of the chart was applied.

Pods with Linkerd-proxy container from previous version i.e. cr.l5d.io/linkerd/proxy:edge-25.2.1 are having the problem, when removed and recreated with cr.l5d.io/linkerd/proxy:edge-25.3.2 there are no duplicated metrics.

Logs, error output, etc

Prometheus logs:

prometheus time=2025-03-18T13:30:12.599Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=podMonitor/linkerd/linkerd-proxy/0 target=http://10.11.71.179:4191/metrics num_dropped=1

Metrics endpoint:

inbound_http_authz_allow_total{target_addr="10.11.85.144:3055",target_ip="10.11.85.144",target_port="3055",srv_group="",srv_kind="default",srv_name="all-unauthenticated",route_group="",route_kind="default",route_name="default",authz_group="",authz_kind="default",authz_name="all-unauthenticated",tls="true",client_id="ingress-nginx.ingress-nginx-public.serviceaccount.identity.linkerd.cluster.local"} 5060
inbound_http_authz_allow_total{target_addr="10.11.85.144:3055",target_ip="10.11.85.144",target_port="3055",srv_group="",srv_kind="default",srv_name="all-unauthenticated",route_group="",route_kind="default",route_name="default",authz_group="",authz_kind="default",authz_name="all-unauthenticated",tls="true",client_id="ingress-nginx.ingress-nginx-public.serviceaccount.identity.linkerd.cluster.local"} 13459

output of linkerd check -o short

n/a

Environment

Kubernetes Version: v1.32.0-eks-2e66e76 Cluster Environment: AWS Host OS: Bottlerocket OS 1.34.0 (aws-k8s-1.32) Linkerd version: edge 2025.3.2

Possible solution

As a workaround if related to upgrade - restart affected pod, so they are injected with newer proxy.

Additional context

No response

Would you like to work on fixing this bug?

None

mrtworo avatar Mar 18 '25 14:03 mrtworo

Do you happen to have the proxy_build_info metric for this pod?

olix0r avatar Mar 19 '25 01:03 olix0r

@olix0r sure, are you looking for version? It was 2.280.0 for affected pods. Additionally I wanted to underline that it seems to be just a side effect of the upgrade that is trivial to resolve, however as I didn't find anything on that in release notes I thought it would be prudent to report such unexpected behaviour just in case there is something more to it :)

mrtworo avatar Mar 19 '25 08:03 mrtworo

hi there @mrtworo, thank you for filing this issue.

i've tried to reproduce this issue using this proxy version, but when i curl the proxy's metrics endpoint i do not see this metric duplicated:

; curl localhost:4191/metrics | grep inbound_http_authz_allow_total > inbound_http_authz_allow_total.txt
; wc -l inbound_http_authz_allow_total.txt
5 inbound_http_authz_allow_total.txt
; uniq inbound_http_authz_allow_total.txt | wc -l
5

i am relieved to hear that this was trivial to resolve, and appreciate you taking the time to file this bug report.

if i can ask, were these errors recurring consistently after upgrading, or were they specific to the time frame when you upgraded from 2025.2.1 to 2025.3.2?

cratelyn avatar Mar 19 '25 20:03 cratelyn

@cratelyn happy to help, errors in prometheus logs started right after resources generated via new chart were applied in our cluster:

  • pods present during chart upgrade, injected with 2025.2.1 and not restarted, began to expose the duplicated metrics at the time and were consistently doing so until we noticed couple hours after and deleted them, so they were injected with 2025.3.2
  • pods that were restarted due to other activities and injected with new proxy were fine

mrtworo avatar Mar 20 '25 10:03 mrtworo

Reporting the same thing here. Except we went from "2025.2.3" to "2025.3.4"

time=2025-04-04T11:16:23.018Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=podMonitor/linkerd/linkerd-proxy/0 ta
rget=http://10.16.3.218:4191/metrics num_dropped=1
time=2025-04-04T11:16:33.018Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=podMonitor/linkerd/linkerd-proxy/0 ta
rget=http://10.16.3.218:4191/metrics num_dropped=1
time=2025-04-04T11:16:43.017Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=podMonitor/linkerd/linkerd-proxy/0 ta
rget=http://10.16.3.218:4191/metrics num_dropped=1
time=2025-04-04T11:16:53.018Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=podMonitor/linkerd/linkerd-proxy/0 ta
rget=http://10.16.3.218:4191/metrics num_dropped=1

jseiser avatar Apr 04 '25 11:04 jseiser

Seeing this after upgrading to 2025.4.4 from 2024.11.8.

ts=2025-06-26T04:29:05.868Z caller=scrape.go:1820 level=warn component="scrape manager" scrape_pool=linkerd-proxy target=http://10.1.3.151:4191/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

As suggested, re-launching the pods so they get injected with the new proxy version seems to fix it.

cmartell-at-m42 avatar Jun 26 '25 04:06 cmartell-at-m42

after further investigation, i was able to reproduce this, and have identified a fix. more to come soon, i will follow up when a pull request is in review, and when an edge release including a fix is released. thank you all!

cratelyn avatar Jun 30 '25 14:06 cratelyn

this is fixed in linkerd/linkerd2-proxy#3987! this should be included in an edge release this afternoon.

cratelyn avatar Jul 02 '25 16:07 cratelyn

this is fixed in https://github.com/linkerd/linkerd2/releases/tag/edge-25.7.1.

it's worth pointing out that because of the edge release paradigm that Linkerd follows, this issue will persist when upgrading from versions prior to 2025.3.2.

this issue did spot, as outlined in linkerd/linkerd2-proxy#3987, two issues with our metric labeling however. that fix will ensure that these duplciate metrics are not encountered again in the future. 🙂

cratelyn avatar Jul 02 '25 18:07 cratelyn

@cratelyn when you say "this issue will persist when upgrading from versions prior to 2025.3.2.", is there anything an end user need concern themselves with when upgrading to ensure they can circumvent this issue? Or should it be taken care of automatically upon edge upgrade to 2025.7.1 or later?

alekhrycaiko avatar Oct 21 '25 19:10 alekhrycaiko