kube-router
kube-router copied to clipboard
Martian packets in DSR mode
I have a test cluster setup. 1 master and 3 workers. Kube-router is running on all 4 nodes. I'm running 1 external IP for nginx (3 instances) with BGP amongst all kube-routers and BGP up to an upstream router. So, packet flow inbound is:
router->1 of the 4 nodes IPVS->IPIP tunnel to 1 of the 3 nginx instances->nginx
Inbound always works fine.
Outbound: nginx instance->host->router
Sometimes, and I don't know what causes this to engage, the host starts to drop the replies. I enable martian logging, and it's hitting the martian case. I tried to disable rp_filter for all interfaces on the host (including all and default) and there are still martians.
IPVS table:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.116.3.1:443 rr
-> 10.116.20.150:6443 Masq 1 1 0
TCP 10.116.3.10:53 rr
-> 10.116.128.6:53 Masq 1 0 0
-> 10.116.128.7:53 Masq 1 0 0
TCP 10.116.3.178:443 rr
-> 10.116.130.38:8443 Masq 1 0 0
TCP 10.116.3.216:80 rr
-> 10.116.129.97:80 Masq 1 0 0
-> 10.116.130.41:80 Masq 1 0 0
-> 10.116.131.28:80 Masq 1 0 0
TCP 10.116.4.2:443 rr
-> 10.116.130.38:8443 Masq 1 1 0
TCP 10.116.20.152:30799 rr
-> 10.116.129.97:80 Masq 1 0 0
-> 10.116.130.41:80 Masq 1 0 0
-> 10.116.131.28:80 Masq 1 0 0
TCP 10.116.20.152:31278 rr
-> 10.116.130.38:8443 Masq 1 0 0
UDP 10.116.3.10:53 rr
-> 10.116.128.6:53 Masq 1 0 0
-> 10.116.128.7:53 Masq 1 0 0
FWM 8742 rr
-> 10.116.129.97:80 Tunnel 1 0 0
-> 10.116.130.41:80 Tunnel 1 0 0
-> 10.116.131.28:80 Tunnel 1 0 0
mangle table:
Chain PREROUTING (policy ACCEPT 2628 packets, 1176K bytes)
pkts bytes target prot opt in out source destination
159 7124 MARK tcp -- * * 0.0.0.0/0 10.116.4.1 tcp dpt:80 MARK set 0x2226
3 156 MARK tcp -- * * 0.0.0.0/0 10.116.4.2 tcp dpt:8443 MARK set 0x280
Chain INPUT (policy ACCEPT 2480 packets, 1159K bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 148 packets, 17027 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 2547 packets, 338K bytes)
pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 2695 packets, 355K bytes)
pkts bytes target prot opt in out source destination
tcpdump showing issue:
# tcpdump -eni any host 10.116.4.1 or ip proto 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:49:35.829486 In 00:1e:be:a5:d0:00 ethertype IPv4 (0x0800), length 68: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0
00:49:35.829566 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:35.829572 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:35.829646 P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:35.829653 In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:36.826269 P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:36.826285 In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.826284 P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.826303 In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.830451 In 00:1e:be:a5:d0:00 ethertype IPv4 (0x0800), length 68: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0
00:49:38.830479 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:38.830482 Out 0a:58:0a:74:82:01 ethertype IPv4 (0x0800), length 88: 10.116.130.1 > 10.116.130.41: 10.104.5.122.51226 > 10.116.4.1.80: Flags [S], seq 832170241, win 64240, options [mss 1357,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
00:49:38.830507 P 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
00:49:38.830511 In 0a:58:0a:74:82:29 ethertype IPv4 (0x0800), length 68: 10.116.4.1.80 > 10.104.5.122.51226: Flags [S.], seq 1511523134, ack 832170242, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
dmesg showing martians:
[81632.897744] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00 .X.t...X.t.)..
[81635.233492] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81635.233514] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00 .X.t...X.t.)..
[81636.897461] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81636.897468] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00 .X.t...X.t.)..
[81638.897894] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81638.897899] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00 .X.t...X.t.)..
[81646.897322] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
[81646.897349] ll header: 00000000: 0a 58 0a 74 82 01 0a 58 0a 74 82 29 08 00 .X.t...X.t.)..
@thardie thanks for reporting the issue.
Dealing with martian packets has been the single most challenge in DSR functionality in kube-router. There are policy-based routing rules that kube-router adds to avoid martian packets. Likely they are missing or kube-router failed to configure them by in your setup.
If you still happen to have the setup or able to reproduce this scenario would mind sharing below details.
ip rule list
ip route list table 77
ip route list table 78
In your case to avoid [81635.233492] IPv4: martian source 10.104.5.122 from 10.116.4.1, on dev kube-bridge
I would expect a route in table 78 created by kube-router to cheat kernel to believe 10.104.5.122
is reachable on `kube-bridge.
I added the following 2 lines to each worker and master's /etc/sysctl.conf:
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0
and rebooted them all. Have been unable to reproduce the martians since then. I'm reverting that change and see if I can reproduce the martian issue now.
@murali-reddy I just re-read your comment - The address 10.104.5.122 is the outside client IP (Where the SYN came from, and where the SYN-ACK is going back to). My k8s address are all in 10.116.0.0/16, so I shouldn't expect to see client (outside) addresses in table 78, would I?
I'll continue to try and reproduce and get the ip rules and table output once reproduced again.
@thardie sorry it should be 10.116.4.1
in routing tables 77 and 78
I've been able to reproduce this issue. I checked table 78 and 77. Table 77 is empty, and 78 has:
local default dev lo scope host
I tried adding a route to table 77 (looks like the rule to handle reply traffic coming out from the containers), but doesn't seem to help:
10.116.4.1 dev kube-bridge scope link
Adding it to table 78 seems wrong, since that's traffic coming in, and would mess up the IP-in-IP encapsulation, right? In fact, I start to see ARPs for 10.116.4.1 on kube-bridge if I add the same route to able 78.
@thardie sorry i might have passed wrong table numbers earlier. You should see below tables (name, id)
please see https://github.com/cloudnativelabs/kube-router/blob/v0.2.3/pkg/controllers/proxy/network_services_controller.go#L1728-L1731
customDSRRouteTableID = "78"
customDSRRouteTableName = "kube-router-dsr"
externalIPRouteTableId = "79"
externalIPRouteTableName = "external_ip"
Following combination of iptable mangle rules, policy based routing achieve the DSR.
For the incoming traffic towards external IP used for service marked with DSR following are rules apply:
- generate a unique fwmark number per service and fmwark the packets
- match traffic marked with a fmwark and use routing table 78
- default rule in table 78 deliver the packet locally to the host
iptables -t mangle -A -d externalIP -m protocol, -p protocol --dport port -j MARK --set-mark generated-fwmark
ip rule add prio 32764 fwmark generated-fwmark table customDSRRouteTableID
ip route add local default dev lo table customDSRRouteTableID
on the return path of packet from the pods below rules are applicable. Second rule in particular avoids the martian packets.
ip rule add prio 32765 from all lookup externalIPRouteTableId
ip route add externalIP dev kube-bridge table externalIPRouteTableId
Please match with this description in your setup and see if there is anything missing.
Hi, @thardie did you use Loadbalancer to public service? I have the same issue when I use DSR mode with metallb in layer2.
Loadbalancer not supported in code, I add it and testing ok. https://github.com/cloudnativelabs/kube-router/blob/4afd6d6d2ab9c94abc5985c30c56ca2605a70a3f/pkg/controllers/proxy/network_services_controller.go#L2198
Can we support Loadbalancer? Is there any risk?@murali-reddy
Closing as stale