linkerd2
linkerd2 copied to clipboard
Linkerd-proxy container don't update discovery information about deleted pods IPs in low-rps outbound pods for a long time
What is the issue?
Hello, Linkerd's community!
I'm researching the problem, that linkerd-proxy send outbound requests to outdated Pod IP (already dead Pod IP). This is especially true for Pods with very low outbound requests frequency, for example, ~1 rph.
I'm watching that, that linkerd-proxy requests and update his internal/cached service discovery information from the linkerd destination service during an outbound request. Because of this, services with low outbound RPS may have a problem with cached IP of dead Pods.
Looking in my metrics screenshots to show an examples of this problem. Example 1. Goglobal-impoter is a service with very low outbound requests frequency (~1 RPH) to service Goglobal-provider. – Target service's Goglobal-provider Pod restarted in 12:16 (12:16 p.m.) and change his actual IP. – At 15:22 (3:22 p.m) service Goglobal-impoter made outbound requests to Goglobal-provider and got into dead IP (10.222.49.44), which changed in 12:16. This caused a number of errors 'connect timed out'.
Request to dead Pod IP:
When target Pod сhange IP, because of being restarted
Fail Fast network connection error in application logs
Example 2.
Krakend is a service with high outbound RPS, but Dictionary with very low outbound RPS in another services. Look for requests RPS to linkerd destination service.
Dictionary service has 0 RPS to Linkerd Destination
How can it be reproduced?
Run Service A -> Service B; Make request from A to B; Restart Service B to change the IP address of the Pod. Wait for ~1 hour and make outbound request from Service A to Service B, and check the IP address of the target Pod to which the request went.
Logs, error output, etc
Logs from linkerd-proxy container:
{"app":{"fields":{"error":"endpoint 10.222.34.22:8090: connect timed out after 3s","message":"Failed to connect"},"level":"WARN","spans":[{"name":"outbound"},{"addr":"10.222.49.44:8090","name":"proxy"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"},{"addr":"10.222.34.22:8090","name":"endpoint"}],"target":"linkerd_reconnect","threadId":"ThreadId(1)","timestamp":"[ 8204.161469s]"},"kubernetes":{"cluster":"booking-prod-one","container_id":"containerd://3bf07c8f629df930d9fc0dab6318cec7acf32717ce3469175b07f6fdd8140c02","container_image":"cr.l5d.io/linkerd/proxy:stable-2.14.9","container_image_id":"cr.l5d.io/linkerd/proxy@sha256:df43272004ad029cd48236b60d066722a4a923c3c8e4d6f49225873dc2cdf6cb","container_name":"linkerd-proxy","node_labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/instance-type":"vsphere-vm.cpu-16.mem-32gb.os-ubuntu","beta.kubernetes.io/os":"linux","failure-domain.beta.kubernetes.io/region":"vcpaas01","failure-domain.beta.kubernetes.io/zone":"0000-CLW-S2-PaaS-01","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"prod-0-84d4f5c9cx5chcn-bbph5","kubernetes.io/os":"linux","node.kubernetes.io/instance-type":"vsphere-vm.cpu-16.mem-32gb.os-ubuntu","ocean.mts.ru/node-group":"prod-0","topology.kubernetes.io/region":"vcpaas01","topology.kubernetes.io/zone":"0000-CLW-S2-PaaS-01"},"pod_annotations":{"cni.projectcalico.org/containerID":"c102d5c57e4d66c908081a8c3218127472c50a24430ae29efb6fc35cfcaa41e0","cni.projectcalico.org/podIP":"10.222.19.127/32","cni.projectcalico.org/podIPs":"10.222.19.127/32","config.linkerd.io/access-log":"json","config.linkerd.io/opaque-ports":"2045,2021","container.seccomp.security.alpha.kubernetes.io/linkerd-init":"runtime/default","container.seccomp.security.alpha.kubernetes.io/linkerd-proxy":"runtime/default","kubectl.kubernetes.io/restartedAt":"2023-12-11T14:08:37Z","linkerd.io/created-by":"linkerd/proxy-injector stable-2.14.9","linkerd.io/inject":"enabled","linkerd.io/proxy-version":"stable-2.14.9","linkerd.io/trust-root-sha256":"42fb25c28958c713ca9a56ea8ed510750a5f1b37d877388dcc131d9bc51ddcd7","vault.security.banzaicloud.io/vault-role":"d51ef6fa-8bcc-4731-bcf2-be77dc939e82","viz.linkerd.io/tap-enabled":"true"},"pod_ip":"10.222.19.127","pod_ips":["10.222.19.127"],"pod_labels":{"app":"goglobal-importer","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"goglobal-importer","linkerd.io/workload-ns":"booking-prod-inside","pod-template-hash":"5578c5f7f6"},"pod_name":"goglobal-importer-5578c5f7f6-72pv5","pod_namespace":"booking-prod-inside","pod_node_name":"prod-0-84d4f5c9cx5chcn-bbph5","pod_owner":"ReplicaSet/goglobal-importer-5578c5f7f6"}}
output of linkerd check -o short
linkerd check -o short
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints
Status check results are √
Environment
K8s version: v1.22.15 Linkerd version: stable-2.14.9 Linkerd Proxy Discovery Cache Config:
outboundDiscoveryCacheUnusedTimeout: "60s"
inboundDiscoveryCacheUnusedTimeout: "120s"
Possible solution
In Application (Service) a possible solution would be add for low outbound rps services a mechanism for periodic network ping of the target services that it uses.
Additional context
No response
Would you like to work on fixing this bug?
maybe
Thanks for the detailed report. Can you give a shot at the latest edge release? Since 2.14.9 we've pushed a few improvements that address discovery staleness.
Continuation of the topic. Used @alpeb's advice. The installed version is edge-24.2.1. The Krakend service sends traffic to the Auth service. After restarting the Auth service, in the linkerd-proxy container log (Krackend service):
Fail Fast network connection error in application logs:
Our experiments and experiences were struck that the above problem lies in the use of Linkerd and Kubernetes in version v1.22.15. We checked Linkerd with Kubernetes v1.27 and we saw that Linkerd (Destination service) correctly listen Kubernetes events and sync data.
@sakharovmaksim thanks a lot for the update and for opening the issue. Since it's been resolved, I'll go ahead and close this. Please let us know if anything changes or if you'd like to re-open this. :)