kube-router icon indicating copy to clipboard operation
kube-router copied to clipboard

v2.1.1: TCPMSS not setup with DSR

Open rkojedzinszky opened this issue 9 months ago • 1 comments

What happened? Connected to a DSR service, and TCP connections did hang.

What did you expect to happen? Expected to work as with v1.6.1

How can we reproduce the behavior you experienced? Steps to reproduce the behavior:

  1. Have a DSR service
  2. Connect to it from external network
  3. Use tcpdump inside pods' network ns to see the SYN packets' mss value

** System Information (please complete the following information):**

  • Kube-Router Version: v2.1.1

  • Kube-Router Parameters: - --run-router=true - --run-firewall=true - --run-service-proxy=true - --advertise-pod-cidr=false - --advertise-external-ip - --kubeconfig=/var/lib/kube-router/kubeconfig - --iptables-sync-period=24h - --ipvs-sync-period=24h - --routes-sync-period=24h - --auto-mtu=false - --service-external-ip-range=192.168.8.192/27 - --service-external-ip-range=192.168.9.0/24 - --runtime-endpoint=unix:///var/run/crio/crio.sock

  • Kubernetes Version:

$ kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
  • Cloud Type: bare metal
  • Kubernetes Deployment Type: kubeadm
  • Kube-Router Deployment Type: daemonset
  • Cluster Size: 10 nodes

** Logs, other output, metrics ** Please see the attached nft dump. nft-list-ruleset.txt

Additional context nft dump also shows that many rules are repeated.

rkojedzinszky avatar May 07 '24 15:05 rkojedzinszky

Looking at this... I was playing with this in my own cluster a few days ago, and definitely see the TCPMSS clamping rules in the chain (similar to the output nft ruleset that you posted where I also see the TCPMSS configuration).

However, when looking at the tcpdump, I can see mss settings in the TCP headers, however, they aren't the ones that kube-router has configured, so I'm still looking into this a bit more to see if I can explain why they don't match.

aauren avatar May 13 '24 01:05 aauren

Ok, after looking into this some more, it seems that I was using a non-DSR VIP the first time I was testing it. :man_facepalming:

Anyway, after selecting the correct VIP I'm unable to reproduce this on my cluster. I have deployed the following deployment:

apiVersion: v1
kind: Service
metadata:
  annotations:
    kube-router.io/service.dsr: tunnel
    kube-router.io/service.local: "true"
    purpose: "Creates a VIP for balancing an application"
  labels:
    name: whoami
  name: whoami
  namespace: default
spec:
  externalIPs:
  - 10.243.0.1
  ports:
  - name: flask
    port: 5000 
    protocol: TCP
    targetPort: 5000 
  selector:
    name: whoami
  type: ClusterIP

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: whoami
  namespace: default
spec:
  selector:
    matchLabels:
      name: whoami
  template:
    metadata:
      labels:
        name: whoami
    spec:
      containers:
        - name: whoami
          image: "docker.io/containous/whoami"
          imagePullPolicy: Always
          command: ["/whoami"]
          args: ["--port", "5000"]

Then from a node that is outside the Kubernetes cluster (I use FRR for peering and propagating routes) I then curl it as follows:

curl http://10.243.0.1:5000

Taking a TCP dump on the receiving worker's interface I see:

# tcpdump -vvvni ens5 port 5000                                                                                                                                                                           
tcpdump: listening on ens5, link-type EN10MB (Ethernet), snapshot length 262144 bytes                                                       
01:55:15.971768 IP (tos 0x0, ttl 64, id 15486, offset 0, flags [DF], proto TCP (6), length 60)                                                                                                                                                                                           
    10.95.0.15.53656 > 10.243.0.1.5000: Flags [S], cksum 0x754c (correct), seq 1317075847, win 62727, options [mss 8961,sackOK,TS val 4009135325 ecr 0,nop,wscale 7], length 0
01:55:15.971876 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)                                                                                                                                                                                               
    10.243.0.1.5000 > 10.95.0.15.53656: Flags [S.], cksum 0x1590 (incorrect -> 0xd938), seq 276206397, ack 1317075848, win 62643, options [mss 8941,sackOK,TS val 1808698600 ecr 4009135325,nop,wscale 7], length 0
01:55:15.972184 IP (tos 0x0, ttl 64, id 15487, offset 0, flags [DF], proto TCP (6), length 52)

You'll notice above, that the SYN packet has the mss set value set to what the interface allows via it's MTU: 8961 whereas the SYN-ACK packet has the clamped mss value set of 8941 which is the value that kube-router has defined in the mangle table.

Additionally, looking towards the veth interface towards my pod, I can see the following:

# tcpdump -vvvni any host 10.242.1.19                                                                                           
tcpdump: data link type LINUX_SLL2                                                                                                                                                                                                                                                       
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes                            
01:58:55.428818 kube-bridge Out IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)                 
    10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 52611, offset 0, flags [DF], proto TCP (6), length 60)                
    10.95.0.15.33544 > 10.243.0.1.5000: Flags [S], cksum 0xac83 (correct), seq 1574939217, win 62727, options [mss 8941,sackOK,TS val 4009354782 ecr 0,nop,wscale 7], length 0
01:58:55.428826 veth2c9fb91a Out IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)            
    10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 52611, offset 0, flags [DF], proto TCP (6), length 60)             
    10.95.0.15.33544 > 10.243.0.1.5000: Flags [S], cksum 0xac83 (correct), seq 1574939217, win 62727, options [mss 8941,sackOK,TS val 4009354782 ecr 0,nop,wscale 7], length 0

Showing that all of the IPIP wrapped packets have the correct MSS set, while the outer layer of the packet is intact.

Finally, using nsenter to enter the pod's network namespace and also running tcpdump also shows the same:

# nsenter -n -t 228927 tcpdump -vvvni any host 10.242.1.19
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
02:08:41.036133 eth0  In  IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto IPIP (4), length 80)
    10.242.1.1 > 10.242.1.19: IP (tos 0x0, ttl 63, id 59701, offset 0, flags [DF], proto TCP (6), length 60)
    10.95.0.15.48488 > 10.243.0.1.5000: Flags [S], cksum 0x835f (correct), seq 1890883248, win 62727, options [mss 8941,sackOK,TS val 4009940389 ecr 0,nop,wscale 7], length 0

aauren avatar May 14 '24 02:05 aauren