linkerd2
linkerd2 copied to clipboard
Linkerd Destination and Proxy Injector Intermittent OOMKilled
What is the issue?
I've noticed the linkerd destination and proxy injector control plane components restart every now and then due to an OOMKilled error.
I am running linkerd w/ the recommended production level configs (e.g. 3 instances of each control plane component).
The destination and injector components have been assigned a 250Mi memory limit.
I notice that all three replicas of these components restart at about the same time (give or take a few minutes) - exiting w/ the same OOMKilled error (error 137).
Here are some resource usage charts. The first one is linkerd destination's resource usage over the past month:

And this one shows the proxy injector's resource usage over the past month:

Why do these spikes occur? Perhaps these spikes are associated w/ rollout of a lot pods? But that doesn't explain some of the spikes because I know for sure we didn't do any major rollout.
The linkerd identity component does not show the same behavior.
The cluster that linkerd is running on has several hundred pods running. Could linkerd be running into issues w/ handling that many pods? How many pods can linkerd handle w/ the production level configuration?
Thank you for the help.
How can it be reproduced?
N/A
Logs, error output, etc
This is what the pod state shows for all of the linkerd destination and injector replicas (the times vary by a few minutes):
State: Running
Started: Fri, 15 Apr 2022 05:41:27 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 04 Apr 2022 19:17:12 -0400
Finished: Fri, 15 Apr 2022 05:41:26 -0400
output of linkerd check -o short
Linkerd core checks
===================
kubernetes-version
------------------
× is running the minimum kubectl version
exec: "kubectl": executable file not found in $PATH
see https://linkerd.io/2.11/checks/#kubectl-version for hints
linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
certificate will expire on 2022-04-16T10:19:53Z
see https://linkerd.io/2.11/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
certificate will expire on 2022-04-16T10:19:27Z
see https://linkerd.io/2.11/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
Status check results are ×
Linkerd extensions checks
=========================
linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
certificate will expire on 2022-06-08T17:09:25Z
see https://linkerd.io/2.11/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ linkerd-viz pods are injected
could not find proxy container for prometheus-797c7d558b-hrfqc pod
see https://linkerd.io/2.11/checks/#l5d-viz-pods-injection for hints
‼ viz extension proxies and cli versions match
prometheus-797c7d558b-hrfqc running but cli running stable-2.11.1
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- Kubernetes Version: 1.20.15-gke.2500
- Cluster Environment: GKE
- Host OS: cos_containerd
- Linkerd version: 2.11.1
Possible solution
N/A
Additional context
N/A
Would you like to work on fixing this bug?
No response
There are numerous reasons on why these components may be getting OOM killed and we'll be happy to try and narrow in on what is happening; we will need more information though in order to start that process.
To clarify, its it the app or proxy containers of the linkerd-destination
and linkerd-proxy-injector
components that are being OOM killed?
Why do these spikes occur? Perhaps these spikes are associated w/ rollout of a lot pods?
This is a question you'll have to answer as you'll have a better idea of what is going on in your cluster around these spikes. The rollout of a lot of pods does sound likely as they would all be injected and lead to new destination requests, but it's best to actually confirm this.
It would also be helpful if you are able to provide a minimal reproducible example of this behavior. I realize this may be something that is hard to do and only exhibited on a large cluster, but so far there isn't any actual behavior that we can start looking into more closely.
I ran into this issue as well. In my case the app containers of linkerd-destination
and linkerd-proxy-injector
were being OOMKilled. Proxy injector's last message in its log before being killed was "waiting for caches to sync". I had to run linkerd upgrade --set-string destinationResources.memory.limit=500Mi,proxyInjectorResources.memory.limit=500Mi | kubectl apply -f -
(changing the memory limit from 250Mi to 500Mi).
I suspect that the out of memory error was due to the number of namespaces/deployments that we have in this Kubernetes environment.
I'm currently experiencing the same issue. The root cause appears to be upgrading our AWS EKS cluster add-ons, specifically:
- coredns "v1.8.4-eksbuild.1" -> "v1.8.7-eksbuild.1"
- kube-proxy "v1.21.2-eksbuild.2" -> "v1.22.11-eksbuild.2"
- vpc-cni "v1.11.0-eksbuild.1" -> "v1.11.2-eksbuild.1"
I'm not sure which is the culprit as they were upgraded in tandem.
After some testing, it might be because we had 15 or so pods crash looping constantly (falco). I'll leave it disabled over the weekend to confirm.
following up, yes that was the issue. linkerd-destination OOMs when there are too many other crash looping pods.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
We have confirmed there are memory issues in the controllers when there's lot of resource churn in the cluster. We've recently updated the linkerd-injector component with a change that will vastly improve its memory consumption, in edge-22.10.3, which will be included in stable-2.13.0. And we're currently working on related changes to linkerd-destination to improve this situation as well. Stay tuned :-)
@alpeb Do you happen to have any updates and ETA on the fix? I see a lot of OOMKilled from Linkerd Destination
We still have a lot of ongoing work for this issue, but there have been several changes recently that show promising improvement. As alpeb already mentioned, #9650 introduced the use of the k8s metadata API in the proxy and tap injectors which allow them to track only the metadata about resources, not the resources themselves. This should reduce in a smaller cache size for those components. This change is being worked on in the destination component and is tracked by #9986.
Additionally, and most promising right now, is we recently merged a fix that allows the destination's EndpointSlice tracking to properly cleanup EndpointSlices belonging to deleted Pods #10013. Before this fix, deleted EndpointSlices were not being deleted from the EndpointSlice cache which means that with enough Pod churn, we could definitely see memory grow overtime. This will be released in the next edge release and I'd recommend testing that out when it's available.
Finally, this issue has exposed the fact that our current metrics make it difficult to track these types of issues down. If #10013 fixes the issue then great, but if it does not, we need to be able to work with the currently available metrics and determine what data points would help track stuff like this down in the future.
@Teko012 for your next steps I'd recommend upgrading to the next edge release when it's available and see if the EndpointSlice fix is the culprit here. In the near future, we'll determine additional metrics that would be helpful for tracking this kind of issue down without so much back-and-forth.
Hi @alpeb I am facing similar issue with 2.12.3 linkerd version in my case destination deployment keeps on going crashloopbackoff with destination container in destination pod with OOM killed status
Can you please let us know how we can overcome this issue
@kleimkuhler @alpeb Does this mean that the previously mentioned #9986 is not a priority anymore? We also see similar crashes in linkerd-proxy
, but haven't tried the edge release yet.
#9986 is still prioritized and will possibly be included in the next stable release 2.13. However, we've recently merged 2 changes that should fix separate memory leaks in the destination controller. #10013, and more recently #10201, both could result in noticeable leaks when there is enough Pod churn on a cluster.
Our edge release this week should include #10201 and I'd recommend trying it once it's released. We closed this issue because with both of these fixes, we'd like to get an idea if there is still something to track down. If so, opening a new issue with a new description would be helpful.