kind
kind copied to clipboard
Avoid SNAT when a pod contacts another pod in host-network
The problem
When a pod pings a pod in host-network on another node, the received packet has the node IP as the source IP instead of the pod IP
Let's make an example:
I have a setup with 2 worker nodes and 3 pods:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1-85466549f4-k88wk 1/1 Running 0 20m 172.21.0.5 cheina-cluster1-worker <none> <none>
pod2-75697dd9c6-8bdkp 1/1 Running 0 20m 10.112.1.8 cheina-cluster1-worker <none> <none>
pod3-6c79b69577-cwvbp 1/1 Running 0 19m 10.112.2.6 cheina-cluster1-worker2 <none> <none>
IMPORTANT: pod1 is in host-network
When pod2 pings pod1 (ping -c1 172.21.0.5
), pod1 receives this packet:
root@cheina-cluster1-worker:/# tcpdump -tnl -i any icmp
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
vethaac008fb In IP 10.112.1.8 > 172.21.0.5: ICMP echo request, id 42464, seq 1, length 64
vethaac008fb Out IP 172.21.0.5 > 10.112.1.8: ICMP echo reply, id 42464, seq 1, length 64
The received ICMP request has 10.112.1.8 as the source IP, which is the pod2 IP
If we repeat the same test with pod3 and pod1, the results are:
root@cheina-cluster1-worker:/# tcpdump -tnl -i any icmp
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
eth0 In IP 172.21.0.9 > 172.21.0.5: ICMP echo request, id 18956, seq 1, length 64
eth0 Out IP 172.21.0.5 > 172.21.0.9: ICMP echo reply, id 18956, seq 1, length 64
The packet source IP is not the pod one, but the node where the pod is scheduled:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cheina-cluster1-control-plane Ready control-plane 4h50m v1.29.0 172.21.0.4 <none> Debian GNU/Linux 11 (bullseye) 5.15.0-88-generic containerd://1.7.1
cheina-cluster1-worker Ready <none> 4h49m v1.29.0 172.21.0.5 <none> Debian GNU/Linux 11 (bullseye) 5.15.0-88-generic containerd://1.7.1
cheina-cluster1-worker2 Ready <none> 4h49m v1.29.0 172.21.0.9 <none> Debian GNU/Linux 11 (bullseye) 5.15.0-88-generic containerd://1.7.1
Why this should be a problem.
If 2 pods (one in host-network) need to receive packets with the same source IP used to contact the "other" pod this should be a problem.
In particular, I work for the open source project liqo, and we are developing a modular multicluster network solution. We decided to use geneve to create tunnels between nodes (pods in host-network) and a gateway (a common pod used to reach a remote cluster).
Geneve is working with all the major CNIs (cilium, calico, flannel), but not with kindnet. The cause is the problem I exposed upon.
A possible solution
I think the cause of the problem is this iptables chain contained inside the kind node.
Chain POSTROUTING (policy ACCEPT 100 packets, 7756 bytes)
pkts bytes target prot opt in out source destination
90 7106 KUBE-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 172.21.0.1
59 4690 KIND-MASQ-AGENT all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type !LOCAL /* kind-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom KIND-MASQ-AGENT chain */
Chain KIND-MASQ-AGENT (1 references)
pkts bytes target prot opt in out source destination
3 252 RETURN all -- * * 0.0.0.0/0 10.112.0.0/16 /* kind-masq-agent: local traffic is not subject to MASQUERADE */
3 228 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kind-masq-agent: outbound traffic is subject to MASQUERADE (must be last in chain) */
Should be possible to include a RETURN rule for the packets with nodeCIDR as a destination or it should cause some trouble?
thanks for the great analysis, we discussed about this in sig-network here https://groups.google.com/g/kubernetes-sig-network/c/m6lwTjKLV8o/m/lnir_lqECwAJ
kindnet is very simple and is an internal detail of kind, so it simplifies to avoid masquerading only for the pod subnets that are the ones we are 100% we don't want to masquerade
https://github.com/kubernetes-sigs/kind/blob/40c81f187425254daf2bf84360a6257a278252df/images/kindnetd/cmd/kindnetd/main.go#L126-L151C55
For the nodes IPs yes, is perfectly fine also to non-masquerade at all, but why is this a problem for you? we don't expect anybody to build network solutions on top of kindnet, that is why we have the disableCNI
option on the kind config API
For the nodes' IPs yes, is perfectly fine also to non-masquerade at all, but why is this a problem for you? we don't expect anybody to build network solutions on top of kindnet, that is why we have the
disableCNI
option on the kind config API
Hi @aojea, thanks for your answer. Our solution points to running upon the cluster CNI.
Long story short, we create a geneve tunnel from each node targeting a pod called "gateway" which uses wireguard to connect to another cluster's "gateway". To do it, we have a daemonset running in host-network on each node, it creates a geneve interface using as a remote endpoint the "gateway" IP. In the "gateway" we have a geneve interface connected to each node (they use as a remote endpoint the IP of the nodes).
When the traffic goes from the "gateway" to the node, the geneve interface on the node receives the encapsulated traffic with the node IP (the one where the gateway is scheduled) as source IP, which is different from the IP used as remote endpoint inside the node (it is a pod IP)
Yes, but kindnet is only used in kind, is not something you are going to run your solution on top, you are going to run on calico, cilium, ... you can install those in kind too
I know we can use other CNIs with kind, but using kindnet would be more convenient for development.
However, if you don't think this change would be useful to the community, we can adapt. Otherwise, if it were useful, we would take care of doing the PR with the changes.
It is important that kindnetd remains a very simple and lightweight default, we're only like to consider the behavior a bug if it doesn't meet Kubernetes conformance requirements and generally not taking feature requests here, because again it's intended to be extremely simple and lightweight but conformant.
There's an external forked copy (well forked back to where it started https://github.com/aojea/kindnet), but the OOTB default is not accepting non-critical features.
kindnetd is pretty 1:1 with what we've test Kubernetes on GCE with historically, so I'm additionally hesitant to alter the behavior without proof that we're violating SIG Network's expectations.
Further: KIND is intended to help a) test kubernetes itself b) help users test Kubernetes applications, and while a) takes priority it doesn't seem this helps with a) and for b) it's detrimental to "help" depend on non-conformant cluster expectations. To the extent possible it should be true that if something works on kind it should work on all conformant clusters.
There's an external forked copy (well forked back to where it started https://github.com/aojea/kindnet), but the OOTB default is not accepting non-critical features.
we can accept patches there ,