linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

`linkerd-destination` OOMKilled due to discovery spike in linkerd P2P multicluster, renders cluster inoperable

Open Sierra1011 opened this issue 9 months ago • 6 comments

What is the issue?

As requested by Flynn on Slack.

Setup: running edge-24.3.2, 2 clusters, mirroring some services.

  • Cluster A shares some mirrored services with Cluster B.
  • Cluster B had some bad config rolled out to it, relating to the Kube 1.29/proxy native sidecar changes
  • Cluster B started creating thousands upon thousands (literally, like 10k+) of linkerd-proxy-injector pods, all in the same ReplicaSet, all with the same error (I forget the exact error but it was words to the effect of NonDefaultRestartPolicy, so it was clearly related to that change we'd made)
  • This massively spiked the linkerd-destination pods and resulted in being OOMKilled continuously, taking down the services in the cluster.
  • This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A
  • This causes a denial of service on all services in cluster A.

How can it be reproduced?

Take 2 clusters (A and B) that have pod-to-pod multicluster set up, with at least one service mirrored from A to B. Linkerd deployment will need reasonable resource limits to exhibit the OOMKill and DoS effect. On cluster A, scale a linkerd-injected deployment to something unreasonable, like 50,000 replicas. This should then cause cluster B to attempt discovery of the endpoints. This should cause a spike in resources of the Linkerd control plane in cluster B, especially linkerd-destination pods. If the linkerd-destination resource limits are exceeded, this will result in a failure of the control plane in cluster B, stopping all meshed traffic.

Logs, error output, etc

This is being written retrospectively, so I do not have an output of the destination pods; however they were also being OOMKilled continuously until the number of pods in cluster A reduced to normal levels.

output of linkerd check -o short

Again, this is historic and we have since upgraded from edge-24.3.2 to edge-24.5.1, but nothing else has changed in our setup.

linkerd-version
---------------
‼ cli is up-to-date
    is running version 24.3.2 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-6cfb9689f6-7mj9t (edge-24.5.1)
	* linkerd-destination-6cfb9689f6-mnvnz (edge-24.5.1)
	* linkerd-destination-6cfb9689f6-n5w6l (edge-24.5.1)
	* linkerd-identity-85c5896467-7v82j (edge-24.5.1)
	* linkerd-identity-85c5896467-n6znn (edge-24.5.1)
	* linkerd-identity-85c5896467-r7qgd (edge-24.5.1)
	* linkerd-proxy-injector-589b5cc587-8pz5g (edge-24.5.1)
	* linkerd-proxy-injector-589b5cc587-8w96c (edge-24.5.1)
	* linkerd-proxy-injector-589b5cc587-bjh9l (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6cfb9689f6-7mj9t running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
	* collector-6c98b7c975-w5lmd (edge-24.5.1)
	* jaeger-7f489d75f7-nqxzv (edge-24.5.1)
	* jaeger-injector-567d6756dc-s8lrx (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
    collector-6c98b7c975-w5lmd running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-548778dd4c-9z6tf (edge-24.5.1)
	* metrics-api-548778dd4c-hvn88 (edge-24.5.1)
	* metrics-api-548778dd4c-ltxbm (edge-24.5.1)
	* tap-5f846bb67b-bprgk (edge-24.5.1)
	* tap-5f846bb67b-cjngm (edge-24.5.1)
	* tap-5f846bb67b-qkmxl (edge-24.5.1)
	* tap-injector-58db76686f-jdb6b (edge-24.5.1)
	* tap-injector-58db76686f-kdwp4 (edge-24.5.1)
	* tap-injector-58db76686f-sqv5t (edge-24.5.1)
	* web-6f486c9d84-5gfqs (edge-24.5.1)
	* web-6f486c9d84-c6p9d (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-548778dd4c-9z6tf running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

  • Kubernetes v1.29.3
  • EKS cluster
  • Bottlerocket nodes
  • Cilium CNI in AWS VPC replacement mode

Possible solution

Very much spitballing here, but an option could include a try/fail where indexing destination endpoints stops and operates in a "service mode" if the discovery exceeds an amount of resource usage (this already sounds horribly like a JVM heap argument so take it with a pinch of salt)

Alternatively calculating a spike of pods based on typical discovered pod numbers and incrementing more slowly.

Sharding the destination service could also mitigate this, by breaking up the resources that each pod tries to index... but I'm not sure how reasonable that is as an approach, as the point of HA is that each pod holds all state.

Additional context

No response

Would you like to work on fixing this bug?

None

Sierra1011 avatar May 16 '24 09:05 Sierra1011