linkerd2 linkerd-proxy does not evict idle metrics eventually leading to OOMKILL

What is the issue?

The longer linkerd runs (currently 2.14.8 for us) the more metrics it gathers.

$ cat /tmp/3days  |wc -l
  134464
$ cat /tmp/94minutes  |wc -l
   38510

The breakdown is mostly here:

$ cat /tmp/3days  |cut -d \{ -f1 |sort | uniq -c |sort -nr |head -10
99190 response_latency_ms_bucket
5571 request_total
4906 response_total
4187 tcp_write_bytes_total
4187 tcp_read_bytes_total
4187 tcp_open_total
4187 tcp_open_connections
3815 response_latency_ms_sum
3815 response_latency_ms_count
 165 tcp_close_total

]$ cat /tmp/94minutes  |cut -d \{ -f1 |sort | uniq -c |sort -nr |head -10
26130 response_latency_ms_bucket
2002 tcp_write_bytes_total
2002 tcp_read_bytes_total
2002 tcp_open_total
2002 tcp_open_connections
1102 response_total
1006 request_total
1005 response_latency_ms_sum
1005 response_latency_ms_count
  78 control_response_latency_ms_bucket

Apps here have been gone for well over 12 hours.

cat /tmp/3days  |grep some-app-name |tr ',' '\n' |grep dst_pod_template_hash |sort |uniq -c |sort
11881 dst_pod_template_hash="6f8f9c4fc7" (GONE)
12608 dst_pod_template_hash="d5d97648" (GONE)
20149 dst_pod_template_hash="57c9f79f9c" (GONE)
21998 dst_pod_template_hash="9468b5dbf"

How can it be reproduced?

just start things up do a query then wait 10mins.

per this: https://github.com/linkerd/linkerd2-proxy/blob/release/v2.210.4/linkerd/app/src/env.rs#L238 i believe 10m is the default clear.

We do NOT set LINKERD2_PROXY_METRICS_RETAIN_IDLE anywhere as we use the simple injector that comes with linkerd and its not something thats allowed anyways.

Logs, error output, etc

see above, there isnt really a log for the metrics system that i can find

output of `linkerd check -o short`

inkerd2-cli-stable-2.14.10-darwin-arm64 -L linkerd2 check -o short
linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.8
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.14.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-79586f8584-7vmwl (stable-2.14.8)
	* linkerd-destination-79586f8584-bbsb2 (stable-2.14.8)
	* linkerd-destination-79586f8584-gvdt2 (stable-2.14.8)
	* linkerd-destination-79586f8584-lchh9 (stable-2.14.8)
	* linkerd-destination-79586f8584-qp74t (stable-2.14.8)
	* linkerd-identity-6f7cd5f778-72gv2 (stable-2.14.8)
	* linkerd-identity-6f7cd5f778-mfpt8 (stable-2.14.8)
	* linkerd-identity-6f7cd5f778-vznvl (stable-2.14.8)
	* linkerd-proxy-injector-688b546d57-6ztdv (stable-2.14.8)
	* linkerd-proxy-injector-688b546d57-twxsb (stable-2.14.8)
	* linkerd-proxy-injector-688b546d57-x4ttz (stable-2.14.8)
	* smi-adaptor-5c6d65c5df-c8sn8 (stable-2.14.8)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-79586f8584-7vmwl running stable-2.14.8 but cli running stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
    exec: "linkerd-smi": executable file not found in $PATH
    see https://linkerd.io/2.14/checks/#extensions for hints

Status check results are √```

### Environment

- k8s 1.22,1.23,1.24,1.25
- linkerd 2.14.8, linkerd 2.10.x linkerd 2.8 we have seen this on all of them

### Possible solution

i am not particularly good at rust, i dont follow where this is breaking. 

### Additional context

This appears to be whats not working https://github.com/linkerd/linkerd2-proxy/blob/release/v2.210.4/linkerd/metrics/src/store.rs#L65 

We've seen this thru many many versions for years and our solution until now has been simply to restart many of our deployments once and a while to clear the metrics. This is not a new bug. 

### Would you like to work on fixing this bug?

no

edit: adding some key words 

metrics, retain, evict, retain_idle

Jul 30 '24 15:07 tehlers320

Hi @tehlers320 ,

Our team is looking into this and we'll have an update for you today with our analysis and/or next steps.

Jul 31 '24 14:07 deusxanima

I'm unable to reproduce this on the latest edge:

:; k delete po -n feh feh-server-10000b10x-65c59865dd-w8plw      
pod "feh-server-10000b10x-65c59865dd-w8plw" deleted
:; linkerd diagnostics proxy-metrics -n feh deploy/feh-client-10000b10x| grep "feh-server-10000b10x-65c59865dd-w8plw" |wc -l
35
:; sleep $((60 *11)) && linkerd diagnostics proxy-metrics -n feh deploy/feh-client-10000b10x| grep "feh-server-10000b10x-65c59865dd-w8plw" |wc -l
0

That is, I have a client and server, I delete the server; and then after the 10 minute timeout elapses, the metrics for that pod are no longer exported.

This likely indicates that, in your environment, the proxy still views this pod as an active endpoint. It would be ideal to test a more recent version of Linkerd, as there have been a number of improvements to the load balancer's endpoint management, including new metrics that better describe the balancer's state:

# HELP outbound_http_balancer_p2c_endpoints The number of endpoints currently in the balancer.
# TYPE outbound_http_balancer_p2c_endpoints gauge
outbound_http_balancer_p2c_endpoints{parent_group="core",parent_kind="Service",parent_namespace="feh",parent_name="feh-10000b10x",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="feh",backend_name="feh-10000b10x",backend_port="80",backend_section_name=""} 1
# HELP outbound_http_balancer_p2c_updates The total number of service discovery updates received by a balancer.
# TYPE outbound_http_balancer_p2c_updates counter
outbound_http_balancer_p2c_updates_total{op="Add",parent_group="core",parent_kind="Service",parent_namespace="feh",parent_name="feh-10000b10x",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="feh",backend_name="feh-10000b10x",backend_port="80",backend_section_name=""} 1
outbound_http_balancer_p2c_updates_total{op="Remove",parent_group="core",parent_kind="Service",parent_namespace="feh",parent_name="feh-10000b10x",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="feh",backend_name="feh-10000b10x",backend_port="80",backend_section_name=""} 1
outbound_http_balancer_p2c_updates_total{op="Reset",parent_group="core",parent_kind="Service",parent_namespace="feh",parent_name="feh-10000b10x",parent_port="80",parent_section_name="",backend_group="core",backend_kind="Service",backend_namespace="feh",backend_name="feh-10000b10x",backend_port="80",backend_section_name=""} 1

Additionally, newer proxies include log messages that indicate balancer state changes:

[     2.012962s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:service{ns=feh name=feh-10000b10x port=80}: linkerd_pool_p2c: Adding endpoint addr=10.42.0.235:8081
[   181.921277s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:service{ns=feh name=feh-10000b10x port=80}: linkerd_pool_p2c: Removing endpoint addr=10.42.0.235:8081
[   184.994483s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:service{ns=feh name=feh-10000b10x port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[   184.995438s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:rescue{client.addr=10.42.0.236:42882}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.43.56.114:80: route default.http: backend Service.feh.feh-10000b10x:80: Service.feh.feh-10000b10x:80: service in fail-fast error.sources=[route default.http: backend Service.feh.feh-10000b10x:80: Service.feh.feh-10000b10x:80: service in fail-fast, backend Service.feh.feh-10000b10x:80: Service.feh.feh-10000b10x:80: service in fail-fast, Service.feh.feh-10000b10x:80: service in fail-fast, service in fail-fast]
[   185.605194s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:service{ns=feh name=feh-10000b10x port=80}: linkerd_pool_p2c: Adding endpoint addr=10.42.0.237:8081
[   185.612542s]  INFO ThreadId(01) outbound:proxy{addr=10.43.56.114:80}:service{ns=feh name=feh-10000b10x port=80}: linkerd_proxy_balance_queue::worker: Available; exited failfast elapsed=3.620066278

We'd ideally want to confirm whethere there are removal messages for IPs that are still in the metrics export.

Without upgrading Linkerd, you may have luck increasing the log level with the config.linkerd.io/proxy-log-level annotation, but the newer visibility enhancements should make this problem much easier to diagnose.

Aug 01 '24 18:08 olix0r

Oh interesting... we don't have 500 codes so i kind of assumed the endpoints were up to date.

I will dig further. Thank you for your analysis.

edit: im on vacation for 2 weeks i may not respond for a long while. Sorry.

Aug 02 '24 01:08 tehlers320

@olix0r I work with @tehlers320. I also opened the discussion https://github.com/linkerd/linkerd2/discussions/12924 where we have listed the detailed steps and investigations that we did, but we couldn't reach any conclusion. Can you please relate this issue with the discussion and check if you are able to help us in reaching any conclusion to pin-point the root cause.

Aug 05 '24 13:08 ra-grover

@ra-grover Thank you for the additional details. I am also traveling at the moment, so my responses may be delayed.

Aug 05 '24 13:08 olix0r

This was related to the setting on our Traefik instances which were set to enabled instead of ingress mode. I still dont follow exactly what happened here but somehow the balancer objects just don't show up when in enabled mode and coupled with traefik.

Im going to close this.

Screenshot 2024-09-19 at 1 24 00 PM

Sep 19 '24 18:09 tehlers320

Ah! This is exactly what the 'ingress' mode is for :) In the default mode, Linkerd uses the destination IP as configured by traefik. That is, traefik does its own load balancing and dispatches requests to individual endpoints, and linkerd simply forwards these requests. In 'ingress' mode, Linkerd ignores traefik's endpoint selection and uses the request's Host (or :authority, in HTTP/2) header to do service discovery and load balancing.

Sep 20 '24 15:09 olix0r

linkerd-proxy does not evict idle metrics eventually leading to OOMKILL

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

output of `linkerd check -o short`