linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Linkerd-proxy container don't update discovery information about deleted pods IPs in low-rps outbound pods for a long time

Open sakharovmaksim opened this issue 1 year ago • 2 comments

What is the issue?

Hello, Linkerd's community!

I'm researching the problem, that linkerd-proxy send outbound requests to outdated Pod IP (already dead Pod IP). This is especially true for Pods with very low outbound requests frequency, for example, ~1 rph.

I'm watching that, that linkerd-proxy requests and update his internal/cached service discovery information from the linkerd destination service during an outbound request. Because of this, services with low outbound RPS may have a problem with cached IP of dead Pods.

Looking in my metrics screenshots to show an examples of this problem. Example 1. Goglobal-impoter is a service with very low outbound requests frequency (~1 RPH) to service Goglobal-provider. – Target service's Goglobal-provider Pod restarted in 12:16 (12:16 p.m.) and change his actual IP. – At 15:22 (3:22 p.m) service Goglobal-impoter made outbound requests to Goglobal-provider and got into dead IP (10.222.49.44), which changed in 12:16. This caused a number of errors 'connect timed out'.

Request to dead Pod IP: Screenshot 2024-01-26 at 15 35 02

When target Pod сhange IP, because of being restarted Screenshot 2024-01-26 at 15 35 09

Fail Fast network connection error in application logs Screenshot 2024-01-26 at 15 34 38

Example 2. Krakend is a service with high outbound RPS, but Dictionary with very low outbound RPS in another services. Look for requests RPS to linkerd destination service. Dictionary service has 0 RPS to Linkerd Destination Screenshot 2024-01-28 at 21 39 57

How can it be reproduced?

Run Service A -> Service B; Make request from A to B; Restart Service B to change the IP address of the Pod. Wait for ~1 hour and make outbound request from Service A to Service B, and check the IP address of the target Pod to which the request went.

Logs, error output, etc

Logs from linkerd-proxy container:

{"app":{"fields":{"error":"endpoint 10.222.34.22:8090: connect timed out after 3s","message":"Failed to connect"},"level":"WARN","spans":[{"name":"outbound"},{"addr":"10.222.49.44:8090","name":"proxy"},{"addr":"linkerd-policy.linkerd.svc.cluster.local:8090","name":"controller"},{"addr":"10.222.34.22:8090","name":"endpoint"}],"target":"linkerd_reconnect","threadId":"ThreadId(1)","timestamp":"[  8204.161469s]"},"kubernetes":{"cluster":"booking-prod-one","container_id":"containerd://3bf07c8f629df930d9fc0dab6318cec7acf32717ce3469175b07f6fdd8140c02","container_image":"cr.l5d.io/linkerd/proxy:stable-2.14.9","container_image_id":"cr.l5d.io/linkerd/proxy@sha256:df43272004ad029cd48236b60d066722a4a923c3c8e4d6f49225873dc2cdf6cb","container_name":"linkerd-proxy","node_labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/instance-type":"vsphere-vm.cpu-16.mem-32gb.os-ubuntu","beta.kubernetes.io/os":"linux","failure-domain.beta.kubernetes.io/region":"vcpaas01","failure-domain.beta.kubernetes.io/zone":"0000-CLW-S2-PaaS-01","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"prod-0-84d4f5c9cx5chcn-bbph5","kubernetes.io/os":"linux","node.kubernetes.io/instance-type":"vsphere-vm.cpu-16.mem-32gb.os-ubuntu","ocean.mts.ru/node-group":"prod-0","topology.kubernetes.io/region":"vcpaas01","topology.kubernetes.io/zone":"0000-CLW-S2-PaaS-01"},"pod_annotations":{"cni.projectcalico.org/containerID":"c102d5c57e4d66c908081a8c3218127472c50a24430ae29efb6fc35cfcaa41e0","cni.projectcalico.org/podIP":"10.222.19.127/32","cni.projectcalico.org/podIPs":"10.222.19.127/32","config.linkerd.io/access-log":"json","config.linkerd.io/opaque-ports":"2045,2021","container.seccomp.security.alpha.kubernetes.io/linkerd-init":"runtime/default","container.seccomp.security.alpha.kubernetes.io/linkerd-proxy":"runtime/default","kubectl.kubernetes.io/restartedAt":"2023-12-11T14:08:37Z","linkerd.io/created-by":"linkerd/proxy-injector stable-2.14.9","linkerd.io/inject":"enabled","linkerd.io/proxy-version":"stable-2.14.9","linkerd.io/trust-root-sha256":"42fb25c28958c713ca9a56ea8ed510750a5f1b37d877388dcc131d9bc51ddcd7","vault.security.banzaicloud.io/vault-role":"d51ef6fa-8bcc-4731-bcf2-be77dc939e82","viz.linkerd.io/tap-enabled":"true"},"pod_ip":"10.222.19.127","pod_ips":["10.222.19.127"],"pod_labels":{"app":"goglobal-importer","linkerd.io/control-plane-ns":"linkerd","linkerd.io/proxy-deployment":"goglobal-importer","linkerd.io/workload-ns":"booking-prod-inside","pod-template-hash":"5578c5f7f6"},"pod_name":"goglobal-importer-5578c5f7f6-72pv5","pod_namespace":"booking-prod-inside","pod_node_name":"prod-0-84d4f5c9cx5chcn-bbph5","pod_owner":"ReplicaSet/goglobal-importer-5578c5f7f6"}}

output of linkerd check -o short

linkerd check -o short
linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

Status check results are √

Environment

K8s version: v1.22.15 Linkerd version: stable-2.14.9 Linkerd Proxy Discovery Cache Config:

outboundDiscoveryCacheUnusedTimeout: "60s"
inboundDiscoveryCacheUnusedTimeout: "120s"

Possible solution

In Application (Service) a possible solution would be add for low outbound rps services a mechanism for periodic network ping of the target services that it uses.

Additional context

No response

Would you like to work on fixing this bug?

maybe

sakharovmaksim avatar Jan 28 '24 18:01 sakharovmaksim

Thanks for the detailed report. Can you give a shot at the latest edge release? Since 2.14.9 we've pushed a few improvements that address discovery staleness.

alpeb avatar Feb 01 '24 18:02 alpeb

Continuation of the topic. Used @alpeb's advice. The installed version is edge-24.2.1. The Krakend service sends traffic to the Auth service. After restarting the Auth service, in the linkerd-proxy container log (Krackend service): Screenshot_5 Fail Fast network connection error in application logs: Screenshot_8

lexa322 avatar Feb 07 '24 08:02 lexa322

Our experiments and experiences were struck that the above problem lies in the use of Linkerd and Kubernetes in version v1.22.15. We checked Linkerd with Kubernetes v1.27 and we saw that Linkerd (Destination service) correctly listen Kubernetes events and sync data.

sakharovmaksim avatar Apr 01 '24 11:04 sakharovmaksim

@sakharovmaksim thanks a lot for the update and for opening the issue. Since it's been resolved, I'll go ahead and close this. Please let us know if anything changes or if you'd like to re-open this. :)

mateiidavid avatar Apr 05 '24 15:04 mateiidavid