Some pods have more rps than others
What is the issue?
It can be seen from linkerd-viz that some pods have no rps at all, while others get all the traffic.
How can it be reproduced?
Deploy two app, they communicate over grpc (using fqdn .cluster.local). Both apps are meshed.
Logs, error output, etc
There no errors that can be seen. When we get more load, pods start to crash because not they all are actually working.
output of linkerd check -o short
╰> linkerd check -o short [10:16:57]
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.4.1 but the latest edge version is 25.5.5
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.4.1 but the latest edge version is 25.5.5
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-68f7bd57cb-csvqn (edge-25.4.1)
* linkerd-destination-68f7bd57cb-svt8m (edge-25.4.1)
* linkerd-identity-6f6d4d4f64-6d468 (edge-25.4.1)
* linkerd-identity-6f6d4d4f64-vst8j (edge-25.4.1)
* linkerd-proxy-injector-858587c6ff-b87hs (edge-25.4.1)
* linkerd-proxy-injector-858587c6ff-h4qk6 (edge-25.4.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-6b6994d46-8jbdc (edge-25.4.1)
* prometheus-576d6c98cf-527nh (edge-25.4.1)
* tap-574f8fb84f-2tl8n (edge-25.4.1)
* tap-574f8fb84f-5hbzg (edge-25.4.1)
* tap-574f8fb84f-gnht6 (edge-25.4.1)
* tap-injector-6c9d7895dd-6vl8v (edge-25.4.1)
* web-6b676dcf7-v9kxs (edge-25.4.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
Status check results are √
Environment
- k8s: 1.30.1 (also 1.31.2)
- AKS
- managed zonal cluster
- OS: ubuntu
- linkerd version: edge-25.4.1
- CNI: cilium 1.12.9 (also 1.15.10)
Possible solution
If we remove most loaded pod(s) the other pod(s) start to get all the requests.
Additional context
we are also using topology-mode: auto annotation with services, but it can be seen that only a few pods in the same zone gets requests, while other looks idle
Would you like to work on fixing this bug?
maybe
In addition, we've tried to reproduce this on calico cni cluster and linkerd distributes requests well there. We tried to set the following cilium params with no effort:
bpf-lb-sock-hostns-only: 'true'
cni-exclusive: 'false'
enable-l7-proxy: 'false'
enable-session-affinity: 'false
Installing linkerd-cni didn't helped also.
One more strange behavior: linkerd viz return no data for cilium cluster:
> kubectl config use-context k8s-calico
> linkerd viz stat svc/pl-test-victim
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
pl-test-victim - 100.00% 696.0rps 56ms 96ms 100ms 6
> kubectl config use-context k8s-cilium
> linkerd viz stat svc/pl-test-victim
No traffic found.
linkerd installed from the same values.yaml in both clusters, pl-test-victim is also the same in both clusters.
@1ovsss It sounds to me that Linkerd is not doing load balancing... this would explain why the cilium cluster has no metrics associated with the Service -- the proxy is only managing endpoint-level forwarding as decided by cilium. The output of linkerd diagnostics proxy-metrics on such a client pod would confirm it -- there will be no load balancer metrics for the service.
Your cilium settings look right to me, but I'm not a Cilium expert... you'll want to confirm that Cilium is allowing traffic to target the Service Cluster IP.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.