linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

linkerd-proxy sidecar prometheus metrics disappear

Open sdhoward opened this issue 4 years ago • 7 comments

Bug Report

I have linkerd 2.10.1 running in my v1.19.6-eks cluster. My API gateway pod proxies a request to my web pod. Both are running linkerd-proxy sidecar.

After I issue the request to the web pod, the request shows up in the prometheus metrics on :4191 on the web pod.

$ kubectl -n web port-forward prod-web-56bff6ffdd-56xn4 4191:4191 &
$ curl -s http://127.0.0.1:4191/metrics | grep response_total | grep my.website
response_total{direction="inbound",authority="my.website",target_addr="10.0.79.211:8080",tls="no_identity",no_tls_reason="no_tls_from_remote",status_code="200",classification="success"} 1
route_response_total{direction="inbound",dst="my.website:80",status_code="200",classification="success"} 1

If I check the same prometheus metrics port 10 minutes later, I do not see those records any more; they are missing. There have been no recent restarts of the pod or the container.

This is counter to how counters in prometheus are supposed to work; the values should persist until the container is restarted.

sdhoward avatar Jun 03 '21 22:06 sdhoward

@sdhoward This behavior was added because these metrics include high-cardinality labels like ip addresses. For long running proxies--especially in ingresses--these metrics can incur substantial memory overhead. I think we're open to changing the behavior but it would need some careful consideration to manage these types of problems that have arisen in the past.

Does this behavior cause problems for you in practice?

olix0r avatar Jun 04 '21 16:06 olix0r

This is an issue in low-traffic pods because the values just get overwritten. By what means are the value erased? Some interval?

The labels are nice and I see what you mean about high cardinality but if I had the option to turn those labels off to save my data I probably would.

The lack of documentation is another issue: https://linkerd.io/2.10/reference/proxy-metrics/

A pushgateway couldn't even help with this situation, because sometimes if you're using a pushgateway with a partial data set things still get overwritten.

sdhoward avatar Jun 04 '21 16:06 sdhoward

https://github.com/linkerd/linkerd2/issues/5746 tracks the design work for being able to drop labels on the client side. @sdhoward would that satisfy your use-case?

adleong avatar Jun 07 '21 20:06 adleong

Not really, because I want the data to persist with low cardinality. It looks like there's still no agreement on a setting for that.

sdhoward avatar Jun 14 '21 19:06 sdhoward

it seems like my issue would be addressed if we updated https://linkerd.io/2.10/reference/proxy-metrics/ to reflect:

  • whether response_total has a limitation where the metrics disappear (and describe under what condition)
  • whether route_response_total has a limitation where the metrics disappear (and describe under what condition)
  • whether control_response_total has a limitation where the metrics disappear (and describe under what condition)

It's just not documented and I'm still in the dark about this. @olix0r, you say that behavior was added but you don't say what behavior it is.

sdhoward avatar Jun 18 '21 00:06 sdhoward

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '21 21:09 stale[bot]

I still feel like the clarifications to the docs that I mentioned in my comment https://github.com/linkerd/linkerd2/issues/6222#issuecomment-863647718 would be valuable to have. @olix0r mentioned that this behavior was added for a reason, but didn't go into detail about what the behavior is.

sdhoward avatar Sep 30 '21 04:09 sdhoward