kube-router
kube-router copied to clipboard
v2.1.1: TCPMSS not setup with DSR
What happened? Connected to a DSR service, and TCP connections did hang.
What did you expect to happen? Expected to work as with v1.6.1
How can we reproduce the behavior you experienced? Steps to reproduce the behavior:
- Have a DSR service
- Connect to it from external network
- Use tcpdump inside pods' network ns to see the SYN packets' mss value
** System Information (please complete the following information):**
-
Kube-Router Version: v2.1.1
-
Kube-Router Parameters: - --run-router=true - --run-firewall=true - --run-service-proxy=true - --advertise-pod-cidr=false - --advertise-external-ip - --kubeconfig=/var/lib/kube-router/kubeconfig - --iptables-sync-period=24h - --ipvs-sync-period=24h - --routes-sync-period=24h - --auto-mtu=false - --service-external-ip-range=192.168.8.192/27 - --service-external-ip-range=192.168.9.0/24 - --runtime-endpoint=unix:///var/run/crio/crio.sock
-
Kubernetes Version:
$ kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
- Cloud Type: bare metal
- Kubernetes Deployment Type: kubeadm
- Kube-Router Deployment Type: daemonset
- Cluster Size: 10 nodes
** Logs, other output, metrics ** Please see the attached nft dump. nft-list-ruleset.txt
Additional context nft dump also shows that many rules are repeated.
Looking at this... I was playing with this in my own cluster a few days ago, and definitely see the TCPMSS clamping rules in the chain (similar to the output nft ruleset that you posted where I also see the TCPMSS configuration).
However, when looking at the tcpdump, I can see mss settings in the TCP headers, however, they aren't the ones that kube-router has configured, so I'm still looking into this a bit more to see if I can explain why they don't match.
Ok, after looking into this some more, it seems that I was using a non-DSR VIP the first time I was testing it. :man_facepalming:
Anyway, after selecting the correct VIP I'm unable to reproduce this on my cluster. I have deployed the following deployment:
apiVersion: v1
kind: Service
metadata:
annotations:
kube-router.io/service.dsr: tunnel
kube-router.io/service.local: "true"
purpose: "Creates a VIP for balancing an application"
labels:
name: whoami
name: whoami
namespace: default
spec:
externalIPs:
- 10.243.0.1
ports:
- name: flask
port: 5000
protocol: TCP
targetPort: 5000
selector:
name: whoami
type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: whoami
namespace: default
spec:
selector:
matchLabels:
name: whoami
template:
metadata:
labels:
name: whoami
spec:
containers:
- name: whoami
image: "docker.io/containous/whoami"
imagePullPolicy: Always
command: ["/whoami"]
args: ["--port", "5000"]
Then from a node that is outside the Kubernetes cluster (I use FRR for peering and propagating routes) I then curl it as follows:
curl http://10.243.0.1:5000
Taking a TCP dump on the receiving worker's interface I see:
# tcpdump -vvvni ens5 port 5000
tcpdump: listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
01:55:15.971768 IP (tos 0x0, ttl 64, id 15486, offset 0, flags [DF], proto TCP (6), length 60)
10.95.0.15.53656 > 10.243.0.1.5000: Flags [S], cksum 0x754c (correct), seq 1317075847, win 62727, options [mss 8961,sackOK,TS val 4009135325 ecr 0,nop,wscale 7], length 0
01:55:15.971876 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.243.0.1.5000 > 10.95.0.15.53656: Flags [S.], cksum 0x1590 (incorrect -> 0xd938), seq 276206397, ack 1317075848, win 62643, options [mss 8941,sackOK,TS val 1808698600 ecr 4009135325,nop,wscale 7], length 0
01:55:15.972184 IP (tos 0x0, ttl 64, id 15487, offset 0, flags [DF], proto TCP (6), length 52)
You'll notice above, that the SYN
packet has the mss
set value set to what the interface allows via it's MTU: 8961
whereas the SYN-ACK
packet has the clamped mss
value set of 8941
which is the value that kube-router has defined in the mangle table.
Additionally, looking towards the veth interface towards my pod, I can see the following:
# tcpdump -vvvni any host 10.242.1.19
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
01:58:55.428818 kube-bridge Out IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)
10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 52611, offset 0, flags [DF], proto TCP (6), length 60)
10.95.0.15.33544 > 10.243.0.1.5000: Flags [S], cksum 0xac83 (correct), seq 1574939217, win 62727, options [mss 8941,sackOK,TS val 4009354782 ecr 0,nop,wscale 7], length 0
01:58:55.428826 veth2c9fb91a Out IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)
10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 52611, offset 0, flags [DF], proto TCP (6), length 60)
10.95.0.15.33544 > 10.243.0.1.5000: Flags [S], cksum 0xac83 (correct), seq 1574939217, win 62727, options [mss 8941,sackOK,TS val 4009354782 ecr 0,nop,wscale 7], length 0
Showing that all of the IPIP wrapped packets have the correct MSS set, while the outer layer of the packet is intact.
Finally, using nsenter
to enter the pod's network namespace and also running tcpdump also shows the same:
# nsenter -n -t 228927 tcpdump -vvvni any host 10.242.1.19
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
02:08:41.036133 eth0 In IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)
10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 59701, offset 0, flags [DF], proto TCP (6), length 60)
10.95.0.15.48488 > 10.243.0.1.5000: Flags [S], cksum 0x835f (correct), seq 1890883248, win 62727, options [mss 8941,sackOK,TS val 4009940389 ecr 0,nop,wscale 7], length 0