linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Some pods have more rps than others

Open 1ovsss opened this issue 6 months ago • 1 comments

What is the issue?

It can be seen from linkerd-viz that some pods have no rps at all, while others get all the traffic.

Image

How can it be reproduced?

Deploy two app, they communicate over grpc (using fqdn .cluster.local). Both apps are meshed.

Logs, error output, etc

There no errors that can be seen. When we get more load, pods start to crash because not they all are actually working.

output of linkerd check -o short

╰> linkerd check -o short                                                                                                                                                             [10:16:57]
linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.4.1 but the latest edge version is 25.5.5
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.4.1 but the latest edge version is 25.5.5
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-68f7bd57cb-csvqn (edge-25.4.1)
	* linkerd-destination-68f7bd57cb-svt8m (edge-25.4.1)
	* linkerd-identity-6f6d4d4f64-6d468 (edge-25.4.1)
	* linkerd-identity-6f6d4d4f64-vst8j (edge-25.4.1)
	* linkerd-proxy-injector-858587c6ff-b87hs (edge-25.4.1)
	* linkerd-proxy-injector-858587c6ff-h4qk6 (edge-25.4.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-6b6994d46-8jbdc (edge-25.4.1)
	* prometheus-576d6c98cf-527nh (edge-25.4.1)
	* tap-574f8fb84f-2tl8n (edge-25.4.1)
	* tap-574f8fb84f-5hbzg (edge-25.4.1)
	* tap-574f8fb84f-gnht6 (edge-25.4.1)
	* tap-injector-6c9d7895dd-6vl8v (edge-25.4.1)
	* web-6b676dcf7-v9kxs (edge-25.4.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

  • k8s: 1.30.1 (also 1.31.2)
  • AKS
  • managed zonal cluster
  • OS: ubuntu
  • linkerd version: edge-25.4.1
  • CNI: cilium 1.12.9 (also 1.15.10)

Possible solution

If we remove most loaded pod(s) the other pod(s) start to get all the requests.

Additional context

we are also using topology-mode: auto annotation with services, but it can be seen that only a few pods in the same zone gets requests, while other looks idle

Would you like to work on fixing this bug?

maybe

1ovsss avatar May 30 '25 07:05 1ovsss

In addition, we've tried to reproduce this on calico cni cluster and linkerd distributes requests well there. We tried to set the following cilium params with no effort:

    bpf-lb-sock-hostns-only: 'true'
    cni-exclusive: 'false'
    enable-l7-proxy: 'false'
    enable-session-affinity: 'false 

Installing linkerd-cni didn't helped also.

One more strange behavior: linkerd viz return no data for cilium cluster:

> kubectl config use-context k8s-calico
> linkerd viz stat svc/pl-test-victim 
NAME             MESHED   SUCCESS        RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
pl-test-victim        -   100.00%   696.0rps          56ms          96ms         100ms          6

> kubectl config use-context k8s-cilium
> linkerd viz stat svc/pl-test-victim
No traffic found.

linkerd installed from the same values.yaml in both clusters, pl-test-victim is also the same in both clusters.

1ovsss avatar Jun 04 '25 17:06 1ovsss

@1ovsss It sounds to me that Linkerd is not doing load balancing... this would explain why the cilium cluster has no metrics associated with the Service -- the proxy is only managing endpoint-level forwarding as decided by cilium. The output of linkerd diagnostics proxy-metrics on such a client pod would confirm it -- there will be no load balancer metrics for the service.

Your cilium settings look right to me, but I'm not a Cilium expert... you'll want to confirm that Cilium is allowing traffic to target the Service Cluster IP.

olix0r avatar Aug 19 '25 14:08 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 19 '25 04:11 stale[bot]