linkerd2 Linkerd Destination and Proxy Injector Intermittent OOMKilled

What is the issue?

I've noticed the linkerd destination and proxy injector control plane components restart every now and then due to an OOMKilled error.

I am running linkerd w/ the recommended production level configs (e.g. 3 instances of each control plane component).

The destination and injector components have been assigned a 250Mi memory limit.

I notice that all three replicas of these components restart at about the same time (give or take a few minutes) - exiting w/ the same OOMKilled error (error 137).

Here are some resource usage charts. The first one is linkerd destination's resource usage over the past month:

And this one shows the proxy injector's resource usage over the past month:

Why do these spikes occur? Perhaps these spikes are associated w/ rollout of a lot pods? But that doesn't explain some of the spikes because I know for sure we didn't do any major rollout.

The linkerd identity component does not show the same behavior.

The cluster that linkerd is running on has several hundred pods running. Could linkerd be running into issues w/ handling that many pods? How many pods can linkerd handle w/ the production level configuration?

Thank you for the help.

How can it be reproduced?

N/A

Logs, error output, etc

This is what the pod state shows for all of the linkerd destination and injector replicas (the times vary by a few minutes):

State:          Running
      Started:      Fri, 15 Apr 2022 05:41:27 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 04 Apr 2022 19:17:12 -0400
      Finished:     Fri, 15 Apr 2022 05:41:26 -0400

output of `linkerd check -o short`

Linkerd core checks
===================

kubernetes-version
------------------
× is running the minimum kubectl version
    exec: "kubectl": executable file not found in $PATH
    see https://linkerd.io/2.11/checks/#kubectl-version for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2022-04-16T10:19:53Z
    see https://linkerd.io/2.11/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2022-04-16T10:19:27Z
    see https://linkerd.io/2.11/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints

Status check results are ×

Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2022-06-08T17:09:25Z
    see https://linkerd.io/2.11/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-797c7d558b-hrfqc pod
    see https://linkerd.io/2.11/checks/#l5d-viz-pods-injection for hints
‼ viz extension proxies and cli versions match
    prometheus-797c7d558b-hrfqc running  but cli running stable-2.11.1
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes Version: 1.20.15-gke.2500
Cluster Environment: GKE
Host OS: cos_containerd
Linkerd version: 2.11.1

Possible solution

N/A

Additional context

N/A

Would you like to work on fixing this bug?

No response

Apr 15 '22 16:04 RajKuni

There are numerous reasons on why these components may be getting OOM killed and we'll be happy to try and narrow in on what is happening; we will need more information though in order to start that process.

To clarify, its it the app or proxy containers of the linkerd-destination and linkerd-proxy-injector components that are being OOM killed?

Why do these spikes occur? Perhaps these spikes are associated w/ rollout of a lot pods?

This is a question you'll have to answer as you'll have a better idea of what is going on in your cluster around these spikes. The rollout of a lot of pods does sound likely as they would all be injected and lead to new destination requests, but it's best to actually confirm this.

It would also be helpful if you are able to provide a minimal reproducible example of this behavior. I realize this may be something that is hard to do and only exhibited on a large cluster, but so far there isn't any actual behavior that we can start looking into more closely.

Apr 18 '22 23:04 kleimkuhler

I ran into this issue as well. In my case the app containers of linkerd-destination and linkerd-proxy-injector were being OOMKilled. Proxy injector's last message in its log before being killed was "waiting for caches to sync". I had to run linkerd upgrade --set-string destinationResources.memory.limit=500Mi,proxyInjectorResources.memory.limit=500Mi | kubectl apply -f - (changing the memory limit from 250Mi to 500Mi).

I suspect that the out of memory error was due to the number of namespaces/deployments that we have in this Kubernetes environment.

Jun 06 '22 21:06 jschumacher-dexcom

I'm currently experiencing the same issue. The root cause appears to be upgrading our AWS EKS cluster add-ons, specifically:

coredns "v1.8.4-eksbuild.1" -> "v1.8.7-eksbuild.1"
kube-proxy "v1.21.2-eksbuild.2" -> "v1.22.11-eksbuild.2"
vpc-cni "v1.11.0-eksbuild.1" -> "v1.11.2-eksbuild.1"

I'm not sure which is the culprit as they were upgraded in tandem.

Aug 05 '22 02:08 calvinbui

After some testing, it might be because we had 15 or so pods crash looping constantly (falco). I'll leave it disabled over the weekend to confirm.

Aug 11 '22 23:08 calvinbui

following up, yes that was the issue. linkerd-destination OOMs when there are too many other crash looping pods.

Aug 22 '22 07:08 calvinbui

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Nov 20 '22 13:11 stale[bot]

We have confirmed there are memory issues in the controllers when there's lot of resource churn in the cluster. We've recently updated the linkerd-injector component with a change that will vastly improve its memory consumption, in edge-22.10.3, which will be included in stable-2.13.0. And we're currently working on related changes to linkerd-destination to improve this situation as well. Stay tuned :-)

Dec 05 '22 15:12 alpeb

@alpeb Do you happen to have any updates and ETA on the fix? I see a lot of OOMKilled from Linkerd Destination

Jan 02 '23 23:01 Teko012

We still have a lot of ongoing work for this issue, but there have been several changes recently that show promising improvement. As alpeb already mentioned, #9650 introduced the use of the k8s metadata API in the proxy and tap injectors which allow them to track only the metadata about resources, not the resources themselves. This should reduce in a smaller cache size for those components. This change is being worked on in the destination component and is tracked by #9986.

Additionally, and most promising right now, is we recently merged a fix that allows the destination's EndpointSlice tracking to properly cleanup EndpointSlices belonging to deleted Pods #10013. Before this fix, deleted EndpointSlices were not being deleted from the EndpointSlice cache which means that with enough Pod churn, we could definitely see memory grow overtime. This will be released in the next edge release and I'd recommend testing that out when it's available.

Finally, this issue has exposed the fact that our current metrics make it difficult to track these types of issues down. If #10013 fixes the issue then great, but if it does not, we need to be able to work with the currently available metrics and determine what data points would help track stuff like this down in the future.

@Teko012 for your next steps I'd recommend upgrading to the next edge release when it's available and see if the EndpointSlice fix is the culprit here. In the near future, we'll determine additional metrics that would be helpful for tracking this kind of issue down without so much back-and-forth.

Jan 05 '23 22:01 kleimkuhler

Hi @alpeb I am facing similar issue with 2.12.3 linkerd version in my case destination deployment keeps on going crashloopbackoff with destination container in destination pod with OOM killed status

Can you please let us know how we can overcome this issue

Jan 26 '23 16:01 Vamsi0473

@kleimkuhler @alpeb Does this mean that the previously mentioned #9986 is not a priority anymore? We also see similar crashes in linkerd-proxy, but haven't tried the edge release yet.

Jan 26 '23 17:01 Teko012

#9986 is still prioritized and will possibly be included in the next stable release 2.13. However, we've recently merged 2 changes that should fix separate memory leaks in the destination controller. #10013, and more recently #10201, both could result in noticeable leaks when there is enough Pod churn on a cluster.

Our edge release this week should include #10201 and I'd recommend trying it once it's released. We closed this issue because with both of these fixes, we'd like to get an idea if there is still something to track down. If so, opening a new issue with a new description would be helpful.

Jan 26 '23 17:01 kleimkuhler

linkerd2 linkerd2 copied to clipboard

Linkerd Destination and Proxy Injector Intermittent OOMKilled

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`