The problem

When a pod pings a pod in host-network on another node, the received packet has the node IP as the source IP instead of the pod IP

Let's make an example:

I have a setup with 2 worker nodes and 3 pods:

NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE                      NOMINATED NODE   READINESS GATES
pod1-85466549f4-k88wk   1/1     Running   0          20m   172.21.0.5   cheina-cluster1-worker    <none>           <none>
pod2-75697dd9c6-8bdkp   1/1     Running   0          20m   10.112.1.8   cheina-cluster1-worker    <none>           <none>
pod3-6c79b69577-cwvbp   1/1     Running   0          19m   10.112.2.6   cheina-cluster1-worker2   <none>           <none>

IMPORTANT: pod1 is in host-network

When pod2 pings pod1 (ping -c1 172.21.0.5), pod1 receives this packet:

root@cheina-cluster1-worker:/# tcpdump -tnl -i any icmp
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
vethaac008fb In  IP 10.112.1.8 > 172.21.0.5: ICMP echo request, id 42464, seq 1, length 64
vethaac008fb Out IP 172.21.0.5 > 10.112.1.8: ICMP echo reply, id 42464, seq 1, length 64

The received ICMP request has 10.112.1.8 as the source IP, which is the pod2 IP

If we repeat the same test with pod3 and pod1, the results are:

root@cheina-cluster1-worker:/# tcpdump -tnl -i any icmp
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
eth0  In  IP 172.21.0.9 > 172.21.0.5: ICMP echo request, id 18956, seq 1, length 64
eth0  Out IP 172.21.0.5 > 172.21.0.9: ICMP echo reply, id 18956, seq 1, length 64

The packet source IP is not the pod one, but the node where the pod is scheduled:

NAME                            STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION      CONTAINER-RUNTIME
cheina-cluster1-control-plane   Ready    control-plane   4h50m   v1.29.0   172.21.0.4    <none>        Debian GNU/Linux 11 (bullseye)   5.15.0-88-generic   containerd://1.7.1
cheina-cluster1-worker          Ready    <none>          4h49m   v1.29.0   172.21.0.5    <none>        Debian GNU/Linux 11 (bullseye)   5.15.0-88-generic   containerd://1.7.1
cheina-cluster1-worker2         Ready    <none>          4h49m   v1.29.0   172.21.0.9    <none>        Debian GNU/Linux 11 (bullseye)   5.15.0-88-generic   containerd://1.7.1

Why this should be a problem.

If 2 pods (one in host-network) need to receive packets with the same source IP used to contact the "other" pod this should be a problem.

In particular, I work for the open source project liqo, and we are developing a modular multicluster network solution. We decided to use geneve to create tunnels between nodes (pods in host-network) and a gateway (a common pod used to reach a remote cluster).

Geneve is working with all the major CNIs (cilium, calico, flannel), but not with kindnet. The cause is the problem I exposed upon.

A possible solution

I think the cause of the problem is this iptables chain contained inside the kind node.

Chain POSTROUTING (policy ACCEPT 100 packets, 7756 bytes)
 pkts bytes target     prot opt in     out     source               destination
   90  7106 KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
    0     0 DOCKER_POSTROUTING  all  --  *      *       0.0.0.0/0            172.21.0.1
   59  4690 KIND-MASQ-AGENT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type !LOCAL /* kind-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom KIND-MASQ-AGENT chain */

Chain KIND-MASQ-AGENT (1 references)
 pkts bytes target     prot opt in     out     source               destination
    3   252 RETURN     all  --  *      *       0.0.0.0/0            10.112.0.0/16        /* kind-masq-agent: local traffic is not subject to MASQUERADE */
    3   228 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kind-masq-agent: outbound traffic is subject to MASQUERADE (must be last in chain) */

Should be possible to include a RETURN rule for the packets with nodeCIDR as a destination or it should cause some trouble?

Jan 10 '24 15:01 cheina97

thanks for the great analysis, we discussed about this in sig-network here https://groups.google.com/g/kubernetes-sig-network/c/m6lwTjKLV8o/m/lnir_lqECwAJ

kindnet is very simple and is an internal detail of kind, so it simplifies to avoid masquerading only for the pod subnets that are the ones we are 100% we don't want to masquerade

https://github.com/kubernetes-sigs/kind/blob/40c81f187425254daf2bf84360a6257a278252df/images/kindnetd/cmd/kindnetd/main.go#L126-L151C55

For the nodes IPs yes, is perfectly fine also to non-masquerade at all, but why is this a problem for you? we don't expect anybody to build network solutions on top of kindnet, that is why we have the disableCNI option on the kind config API

Jan 10 '24 15:01 aojea

For the nodes' IPs yes, is perfectly fine also to non-masquerade at all, but why is this a problem for you? we don't expect anybody to build network solutions on top of kindnet, that is why we have the disableCNI option on the kind config API

Hi @aojea, thanks for your answer. Our solution points to running upon the cluster CNI.

Long story short, we create a geneve tunnel from each node targeting a pod called "gateway" which uses wireguard to connect to another cluster's "gateway". To do it, we have a daemonset running in host-network on each node, it creates a geneve interface using as a remote endpoint the "gateway" IP. In the "gateway" we have a geneve interface connected to each node (they use as a remote endpoint the IP of the nodes).

When the traffic goes from the "gateway" to the node, the geneve interface on the node receives the encapsulated traffic with the node IP (the one where the gateway is scheduled) as source IP, which is different from the IP used as remote endpoint inside the node (it is a pod IP)

Jan 10 '24 16:01 cheina97

Yes, but kindnet is only used in kind, is not something you are going to run your solution on top, you are going to run on calico, cilium, ... you can install those in kind too

Jan 11 '24 01:01 aojea

I know we can use other CNIs with kind, but using kindnet would be more convenient for development.

However, if you don't think this change would be useful to the community, we can adapt. Otherwise, if it were useful, we would take care of doing the PR with the changes.

Jan 11 '24 09:01 cheina97

It is important that kindnetd remains a very simple and lightweight default, we're only like to consider the behavior a bug if it doesn't meet Kubernetes conformance requirements and generally not taking feature requests here, because again it's intended to be extremely simple and lightweight but conformant.

There's an external forked copy (well forked back to where it started https://github.com/aojea/kindnet), but the OOTB default is not accepting non-critical features.

kindnetd is pretty 1:1 with what we've test Kubernetes on GCE with historically, so I'm additionally hesitant to alter the behavior without proof that we're violating SIG Network's expectations.

Further: KIND is intended to help a) test kubernetes itself b) help users test Kubernetes applications, and while a) takes priority it doesn't seem this helps with a) and for b) it's detrimental to "help" depend on non-conformant cluster expectations. To the extent possible it should be true that if something works on kind it should work on all conformant clusters.

Jan 11 '24 19:01 BenTheElder

There's an external forked copy (well forked back to where it started https://github.com/aojea/kindnet), but the OOTB default is not accepting non-critical features.

we can accept patches there ,

Jan 11 '24 20:01 aojea

kind
kind copied to clipboard

Avoid SNAT when a pod contacts another pod in host-network

The problem

Why this should be a problem.

A possible solution

kind kind copied to clipboard

Avoid SNAT when a pod contacts another pod in host-network

The problem

Why this should be a problem.

A possible solution

kind
kind copied to clipboard