linkerd2
linkerd2 copied to clipboard
`linkerd-destination` OOMKilled due to discovery spike in linkerd P2P multicluster, renders cluster inoperable
What is the issue?
As requested by Flynn on Slack.
Setup: running edge-24.3.2, 2 clusters, mirroring some services.
- Cluster A shares some mirrored services with Cluster B.
- Cluster B had some bad config rolled out to it, relating to the Kube 1.29/proxy native sidecar changes
- Cluster B started creating thousands upon thousands (literally, like 10k+) of linkerd-proxy-injector pods, all in the same ReplicaSet, all with the same error (I forget the exact error but it was words to the effect of NonDefaultRestartPolicy, so it was clearly related to that change we'd made)
- This massively spiked the linkerd-destination pods and resulted in being OOMKilled continuously, taking down the services in the cluster.
- This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A
- This causes a denial of service on all services in cluster A.
How can it be reproduced?
Take 2 clusters (A and B) that have pod-to-pod multicluster set up, with at least one service mirrored from A to B. Linkerd deployment will need reasonable resource limits to exhibit the OOMKill and DoS effect.
On cluster A, scale a linkerd-injected deployment to something unreasonable, like 50,000 replicas.
This should then cause cluster B to attempt discovery of the endpoints.
This should cause a spike in resources of the Linkerd control plane in cluster B, especially linkerd-destination
pods.
If the linkerd-destination
resource limits are exceeded, this will result in a failure of the control plane in cluster B, stopping all meshed traffic.
Logs, error output, etc
This is being written retrospectively, so I do not have an output of the destination pods; however they were also being OOMKilled continuously until the number of pods in cluster A reduced to normal levels.
output of linkerd check -o short
Again, this is historic and we have since upgraded from edge-24.3.2 to edge-24.5.1, but nothing else has changed in our setup.
linkerd-version
---------------
‼ cli is up-to-date
is running version 24.3.2 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.5.1 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-6cfb9689f6-7mj9t (edge-24.5.1)
* linkerd-destination-6cfb9689f6-mnvnz (edge-24.5.1)
* linkerd-destination-6cfb9689f6-n5w6l (edge-24.5.1)
* linkerd-identity-85c5896467-7v82j (edge-24.5.1)
* linkerd-identity-85c5896467-n6znn (edge-24.5.1)
* linkerd-identity-85c5896467-r7qgd (edge-24.5.1)
* linkerd-proxy-injector-589b5cc587-8pz5g (edge-24.5.1)
* linkerd-proxy-injector-589b5cc587-8w96c (edge-24.5.1)
* linkerd-proxy-injector-589b5cc587-bjh9l (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-6cfb9689f6-7mj9t running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
some proxies are not running the current version:
* collector-6c98b7c975-w5lmd (edge-24.5.1)
* jaeger-7f489d75f7-nqxzv (edge-24.5.1)
* jaeger-injector-567d6756dc-s8lrx (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
collector-6c98b7c975-w5lmd running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-548778dd4c-9z6tf (edge-24.5.1)
* metrics-api-548778dd4c-hvn88 (edge-24.5.1)
* metrics-api-548778dd4c-ltxbm (edge-24.5.1)
* tap-5f846bb67b-bprgk (edge-24.5.1)
* tap-5f846bb67b-cjngm (edge-24.5.1)
* tap-5f846bb67b-qkmxl (edge-24.5.1)
* tap-injector-58db76686f-jdb6b (edge-24.5.1)
* tap-injector-58db76686f-kdwp4 (edge-24.5.1)
* tap-injector-58db76686f-sqv5t (edge-24.5.1)
* web-6f486c9d84-5gfqs (edge-24.5.1)
* web-6f486c9d84-c6p9d (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-548778dd4c-9z6tf running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- Kubernetes v1.29.3
- EKS cluster
- Bottlerocket nodes
- Cilium CNI in AWS VPC replacement mode
Possible solution
Very much spitballing here, but an option could include a try/fail where indexing destination endpoints stops and operates in a "service mode" if the discovery exceeds an amount of resource usage (this already sounds horribly like a JVM heap argument so take it with a pinch of salt)
Alternatively calculating a spike of pods based on typical discovered pod numbers and incrementing more slowly.
Sharding the destination service could also mitigate this, by breaking up the resources that each pod tries to index... but I'm not sure how reasonable that is as an approach, as the point of HA is that each pod holds all state.
Additional context
No response
Would you like to work on fixing this bug?
None