Linkerd preventing a successful TCP connection between two pods
What is the issue?
I have two pods in my vanilla kubernetes cluster. app1 <--------------> apps2 In apps1, I am running a TCP client and this connects to TCP server in another pod at port 40000. In apps2, I am running TCP server at port 40000.
I use same helm chart to deploy two different instances of apps. helm install apps1 . -n sample helm install apps2 . -n sample
Without linkerd [ podAnnotations: {} ], tcp client is able to make a connection with tcp server and tranfers data.
Note that clusterIP and pod-IP;s are as follows: apps1 (clusterIP= 10.99.90.94, podIP=10.244.2.200) apps2 (clusterIP=10.97.202.67 , podIP = 10.244.2.201)
Server Logs: TCP_Server: Server started at 10.244.2.201:40000 TCPServer accepted connection request from client (10.244.2.200:57706)
Client Logs: Connecting to Server IP Address and Port -> 10.97.202.67:40000 TCP client is now connected to server Send only 1 pkts and exit... Done. Exiting now...
Tcpdump capture inside apps2 pod. sudo tcpdump -i any port 40000 -n -vvxx tcpdump: data link type LINUX_SLL2 tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 13:49:43.155451 eth0 In IP (tos 0x0, ttl 64, id 55801, offset 0, flags [DF], proto TCP (6), length 60) 10.244.2.200.57706 > 10.244.2.201.40000: Flags [S], cksum 0x1ba7 (incorrect -> 0x772e), seq 289258587 0, win 64860, options [mss 1410,sackOK,TS val 3259985979 ecr 0,nop,wscale 7], length 0 0x0000: 0800 0000 0000 0002 0001 0006 aa1b af08 0x0010: da8e 0000 4500 003c d9f9 4000 4006 454a 0x0020: 0af4 02c8 0af4 02c9 e16a 9c40 ac69 5b8e 0x0030: 0000 0000 a002 fd5c 1ba7 0000 0204 0582 0x0040: 0402 080a c24f 703b 0000 0000 0103 0307 13:49:43.155471 eth0 Out IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.244.2.201.40000 > 10.244.2.200.57706: Flags [S.], cksum 0x1ba7 (incorrect -> 0xb37a), seq 83335279 9, ack 2892585871, win 65535, options [mss 1410,sackOK,TS val 405242579 ecr 3259985979,nop,wscale 1], len gth 0 0x0000: 0800 0000 0000 0002 0001 0406 baa5 2b7d 0x0010: 7a6f 0000 4500 003c 0000 4000 4006 1f44 0x0020: 0af4 02c9 0af4 02c8 9c40 e16a 31ab f45f 0x0030: ac69 5b8f a012 ffff 1ba7 0000 0204 0582 0x0040: 0402 080a 1827 82d3 c24f 703b 0103 0301 13:49:43.155493 eth0 In IP (tos 0x0, ttl 64, id 55802, offset 0, flags [DF], proto TCP (6), length 52) 10.244.2.200.57706 > 10.244.2.201.40000: Flags [.], cksum 0x1b9f (incorrect -> 0xe013), seq 1, ack 1, win 507, options [nop,nop,TS val 3259985979 ecr 405242579], length 0 0x0000: 0800 0000 0000 0002 0001 0006 aa1b af08 0x0010: da8e 0000 4500 0034 d9fa 4000 4006 4551 0x0020: 0af4 02c8 0af4 02c9 e16a 9c40 ac69 5b8f 0x0030: 31ab f460 8010 01fb 1b9f 0000 0101 080a 0x0040: c24f 703b 1827 82d3 13:49:43.155550 eth0 In IP (tos 0x0, ttl 64, id 55803, offset 0, flags [DF], proto TCP (6), length 152) 10.244.2.200.57706 > 10.244.2.201.40000: Flags [P.], cksum 0x1c03 (incorrect -> 0xdc72), seq 1:101, a ck 1, win 507, options [nop,nop,TS val 3259985979 ecr 405242579], length 100 0x0000: 0800 0000 0000 0002 0001 0006 aa1b af08 0x0010: da8e 0000 4500 0098 d9fb 4000 4006 44ec 0x0020: 0af4 02c8 0af4 02c9 e16a 9c40 ac69 5b8f 0x0030: 31ab f460 8018 01fb 1c03 0000 0101 080a 0x0040: c24f 703b 1827 82d3 4142 4344 4546 4748 0x0050: 494a 4b4c 4d4e 4f50 5152 5354 5556 5758 0x0060: 595a 4142 4344 4546 4748 494a 4b4c 4d4e 0x0070: 4f50 5152 5354 5556 5758 595a 4142 4344 0x0080: 4546 4748 494a 4b4c 4d4e 4f50 5152 5354 0x0090: 5556 5758 595a 4142 4344 4546 4748 494a 0x00a0: 4b4c 4d4e 4f50 5152 5354 5556 13:49:43.155557 eth0 Out IP (tos 0x0, ttl 64, id 59220, offset 0, flags [DF], proto TCP (6), length 52) 10.244.2.201.40000 > 10.244.2.200.57706: Flags [.], cksum 0x1b9f (incorrect -> 0x61aa), seq 1, ack 10 1, win 32768, options [nop,nop,TS val 405242579 ecr 3259985979], length 0 0x0000: 0800 0000 0000 0002 0001 0406 baa5 2b7d 0x0010: 7a6f 0000 4500 0034 e754 4000 4006 37f7 0x0020: 0af4 02c9 0af4 02c8 9c40 e16a 31ab f460 0x0030: ac69 5bf3 8010 8000 1b9f 0000 0101 080a 0x0040: 1827 82d3 c24f 703b 13:49:43.155670 eth0 In IP (tos 0x0, ttl 64, id 55804, offset 0, flags [DF], proto TCP (6), length 52) 10.244.2.200.57706 > 10.244.2.201.40000: Flags [F.], cksum 0x1b9f (incorrect -> 0xdfae), seq 101, ack 1, win 507, options [nop,nop,TS val 3259985979 ecr 405242579], length 0 0x0000: 0800 0000 0000 0002 0001 0006 aa1b af08 0x0010: da8e 0000 4500 0034 d9fc 4000 4006 454f 0x0020: 0af4 02c8 0af4 02c9 e16a 9c40 ac69 5bf3 0x0030: 31ab f460 8011 01fb 1b9f 0000 0101 080a 0x0040: c24f 703b 1827 82d3 13:49:43.197898 eth0 Out IP (tos 0x0, ttl 64, id 59221, offset 0, flags [DF], proto TCP (6), length 52) 10.244.2.201.40000 > 10.244.2.200.57706: Flags [.], cksum 0x1b9f (incorrect -> 0x617e), seq 1, ack 10 2, win 32768, options [nop,nop,TS val 405242622 ecr 3259985979], length 0 0x0000: 0800 0000 0000 0002 0001 0406 baa5 2b7d 0x0010: 7a6f 0000 4500 0034 e755 4000 4006 37f6 0x0020: 0af4 02c9 0af4 02c8 9c40 e16a 31ab f460 0x0030: ac69 5bf4 8010 8000 1b9f 0000 0101 080a 0x0040: 1827 82fe c24f 703b
Error Scenario: With linkerd, there is an issue with tcp connection and data transfer. podAnnotations: linkerd.io/inject: enabled
ClusterIP and PodIP are as follows in this deployment.
apps1 (clusterIP=10.111.233.142, podIP=10.244.2.202)
apps2 (clusterIP=10.103.146.121 , podIP=10.244.2.203)
Start TCP server in apps2 pod. TCP_Server: Server started at 10.244.2.203:40000
Start TCP client in apps1 pod. Connecting to Server IP Address and Port -> 10.103.146.121:40000 TCP client is now connected to server Send only 1 pkts and exit... Done. Exiting now...
But, there is no data received in apps2 pod...
TCPdump inside app1 is as follows: sudo tcpdump -i any port 40000 -n -vvxx tcpdump: data link type LINUX_SLL2 tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 14:08:42.233846 lo In IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [S.], cksum 0xaacc (incorrect -> 0xf048), seq 821429181, ack 2790913719, win 65483, options [mss 65495,sackOK,TS val 1512729640 ecr 1071273262,nop,wscale 7], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 003c 0000 4000 4006 901e 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03bd 0x0030: a659 f6b7 a012 ffcb aacc 0000 0204 ffd7 0x0040: 0402 080a 5a2a 6c28 3fda 552e 0103 0307 14:08:42.233953 lo In IP (tos 0x0, ttl 64, id 44792, offset 0, flags [DF], proto TCP (6), length 52) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [.], cksum 0xaac4 (incorrect -> 0x16a0), seq 1, ack 101, win 511, options [nop,nop,TS val 1512729641 ecr 1071273263], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 0034 aef8 4000 4006 e12d 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03be 0x0030: a659 f71b 8010 01ff aac4 0000 0101 080a 0x0040: 5a2a 6c29 3fda 552f 14:08:42.235970 lo In IP (tos 0x0, ttl 64, id 44793, offset 0, flags [DF], proto TCP (6), length 52) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [F.], cksum 0xaac4 (incorrect -> 0x169b), seq 1, ack 102, win 512, options [nop,nop,TS val 1512729643 ecr 1071273263], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 0034 aef9 4000 4006 e12c 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03be 0x0030: a659 f71c 8011 0200 aac4 0000 0101 080a 0x0040: 5a2a 6c2b 3fda 552f
Looks like linkerd proxy in apps2 pod is not forwarding my connection to tcpserver process in apps2 pod.
How can it be reproduced?
Deploy two pods exposing a tcp port 40000. Run tcpclient in one pod and tcp server in another pod.
Without linkerd, connection is ok and data received at server side. With linkerd, linkerdproxy in server pod is not forwarding pkts to tcpserver process.
Logs, error output, etc
Pkts captured in client pod.
sudo tcpdump -i any port 40000 -n -vvxx
tcpdump: data link type LINUX_SLL2 tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 14:08:42.233846 lo In IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [S.], cksum 0xaacc (incorrect -> 0xf048), seq 821429181, ack 2790913719, win 65483, options [mss 65495,sackOK,TS val 1512729640 ecr 1071273262,nop,wscale 7], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 003c 0000 4000 4006 901e 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03bd 0x0030: a659 f6b7 a012 ffcb aacc 0000 0204 ffd7 0x0040: 0402 080a 5a2a 6c28 3fda 552e 0103 0307 14:08:42.233953 lo In IP (tos 0x0, ttl 64, id 44792, offset 0, flags [DF], proto TCP (6), length 52) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [.], cksum 0xaac4 (incorrect -> 0x16a0), seq 1, ack 101, win 511, options [nop,nop,TS val 1512729641 ecr 1071273263], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 0034 aef8 4000 4006 e12d 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03be 0x0030: a659 f71b 8010 01ff aac4 0000 0101 080a 0x0040: 5a2a 6c29 3fda 552f 14:08:42.235970 lo In IP (tos 0x0, ttl 64, id 44793, offset 0, flags [DF], proto TCP (6), length 52) 10.103.146.121.40000 > 10.244.2.202.59832: Flags [F.], cksum 0xaac4 (incorrect -> 0x169b), seq 1, ack 102, win 512, options [nop,nop,TS val 1512729643 ecr 1071273263], length 0 0x0000: 0800 0000 0000 0001 0304 0006 0000 0000 0x0010: 0000 0000 4500 0034 aef9 4000 4006 e12c 0x0020: 0a67 9279 0af4 02ca 9c40 e9b8 30f6 03be 0x0030: a659 f71c 8011 0200 aac4 0000 0101 080a 0x0040: 5a2a 6c2b 3fda 552f
All packets captured are in lo interface and source address of these packets is tcpserver pod.
output of linkerd check -o short
Linkerd is running successfully in this cluster. I can see two extra containers in each pod. linkerd-init linkerd-proxy
linkerd-proxy logs in client pod are as follows: │ [ 0.005256s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191 │ │ [ 0.005280s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143 │ │ [ 0.005283s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140 │ │ [ 0.005287s] INFO ThreadId(01) linkerd2_proxy: Tap DISABLED │ │ [ 0.005296s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.edge-01.serviceaccount.identity.edge-01.cluster.local │ │ [ 0.005300s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.edge-01.svc.cluster.local:8080 (linkerd-identity.edge-01.serviceaccount.identity.edge-01.cl │ │ [ 0.005307s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.edge-01.svc.cluster.local:8086 (linkerd-destination.edge-01.serviceaccount.identity.edge-01. │ │ [ 0.022346s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.edge-01.serviceaccount.identity.edge-01.cluster.local │ │ [ 225.964488s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connection closed before message completed client.addr=10.244.2.202:48396 │ │ [ 244.136177s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=connection closed before message completed client.addr=10.244.2.202:59832
Environment
vanilla k8s v1.27.16 multi-node cluster. each node is a ubuntu 20.04 kvm.
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
thanks for filing an issue @bhati-github. are you able to share the helm charts you are using, or provide a reproduction of the problem that we can run? this would be helpful for further diagnosing what the issue might be.
Hi @cratelyn , I will prepare a helm chart soon and share with you along with the docker image of pod. I am bit busy with deliverables nowadays. Thanks for your patience.
I am trying to add helm chart and docker image tarfile, but I am getting errors in this window. Is there any other way to share files ?
I want to add these two files for your test. One of them is a helm chart. Another is docker image tarfile (approx 350 MB).
@bhati-github Can you push the Docker image somewhere? That's probably simplest.
Hi, Let me simplify.
This is the Dockerfile you can use to build docker image in your lab.
#################################################################################### FROM ubuntu:24.04@sha256:80dd3c3b9c6cecb9f1667e9290b3bc61b78c2678c02cbdae5f0fea92cc6734ab
RUN apt-get update RUN apt-get install --no-install-recommends --yes sudo tree iproute2 net-tools iputils-ping tcpdump iperf3
RUN groupadd -r -g 1208 appsgrp && useradd -u 1208 -g appsgrp -s /bin/bash -m apps
RUN echo "apps ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/apps
WORKDIR /home/apps/ COPY tcp-client-server-app /home/apps/tcp-client-server-app
RUN chown apps:appsgrp /home/apps/* && chmod +x /home/apps/*
USER 1208:1208
ENTRYPOINT ["tail", "-f", "/dev/null"] ######################################################################################
You need this file "tcp-client-server-app". I am attaching it as a zip file now.
Now, you can make a docker image yourself and upload into your lab repository.
Next thing you want is helm-chart to deploy this app. Here it is...
Please make sure that you specify your repository url inside values.yaml file. --> A leading forward slash is required after repo url. image: repo: "your-lab-repository-url/" name: "apps" tag: "latest"
If linkerd is disabled:
Deploy two instances of this app. helm install client . -n sample helm install server . -n sample
--> Go inside server and start as follows with pod_ip as 10.244.1.155 and tcp server port 4000 prompt> ./tcp-client-server-app -s 10.244.1.155 4000 Mode: Server... TCP_Server: Server started at 10.244.1.155:4000 TCPServer accepted connection request from client (10.244.2.6:46508)
--> When you start client from client pod, logs are as follows:
prompt > ./tcp-client-server-app -c 10.97.21.107 4000 100 10 100
(Please note that 10.97.21.107 is cluster-ip for service "server")
Mode: Client...
Payload Size = 100 bytes
Packet Count = 10
Inter-Pkt Delay = 100.000000 milliseconds
Connecting to Server IP Address and Port -> 10.97.21.107:4000
TCP client is now connected to server
Send only 10 pkts and exit...
Pkts in Interval [1-2]: 10 , Total Pkts = 10
Done. Exiting now...
server-side:
client-side:
Now, enable linkerd in helm chart values.yaml file, and re-deploy client and server instances.
podAnnotations: linkerd.io/inject: enabled
This time, you will see that connection is not established with server. There is no server log which says that "TCPServer accepted connection request from client"
Actually, @bhati-github, let's back up a minute here. What version of Linkerd are you using? If you just install Faces, does that work?
kubectl create ns faces
kubectl annotate ns faces linkerd.io/inject=enabled
helm install -n faces faces \
oci://ghcr.io/buoyantio/faces-chart \
--version 2.0.0 \
--set gui.serviceType=LoadBalancer
kubectl rollout status -n faces deploy
If you open a browser to the faces-gui Service in the faces namespace, you should see a grid of faces, some of which are grinning faces on a blue background.
I ask because, if I'm reading the templates in your chart correctly, you don't declare 40000 as a valid port anywhere: your Service definition calls out port 4000. Recent Linkerds don't allow connections to nondeclared ports...
Hi @kflynn , sorry for this mis-understanding because of my initial logs posted in the issue. At that time of posting the logs, I was trying with port 40000. But, because of some reason I changed port value to 4000 recently. The latest helm chart I shared around 2-3 days back in this issue is actually using port 4000 and issue was re-producible.
I am testing this in a pre-production lab environment in which telco-grade software is deployed. Linkerd is deployed in the cluster for protection of tcp streams for pod-to-pod traffic on selected ports.
I will also check about your suggestion to use "faces". I will post the results soon.
Actually, @bhati-github, let's back up a minute here. What version of Linkerd are you using? If you just install Faces, does that work?
kubectl create ns faces kubectl annotate ns faces linkerd.io/inject=enabled helm install -n faces faces \ oci://ghcr.io/buoyantio/faces-chart \ --version 2.0.0 \ --set gui.serviceType=LoadBalancer kubectl rollout status -n faces deployIf you open a browser to the
faces-guiService in thefacesnamespace, you should see a grid of faces, some of which are grinning faces on a blue background.I ask because, if I'm reading the templates in your chart correctly, you don't declare 40000 as a valid port anywhere: your Service definition calls out port 4000. Recent Linkerds don't allow connections to nondeclared ports...
I was able to deploy "faces" as per your sequence of commands. After I tried to access the gui, I could see a 4x4 grid of faces (but none of them are actually stable, the faces were keep changing. Sad->Angry->Happy.. It was difficult for me to extract a meaningful picture of the services from that fluctuating display ..
If it is possible for you, we can join together in a teams or zoom call meeting so that I can show you exactly what is happening at my end. If yes, you can let me know your timezone and preferred time along with email id. I will send an invite.
Oh dear, @bhati-github, my apologies for missing this! 🙁 If you're still running into trouble I'd be happy to jump on Zoom sometime -- I'm US/Eastern time, so GMT-5 at the moment.