kube-router icon indicating copy to clipboard operation
kube-router copied to clipboard

Bug in network policy ipsets when using dualStack

Open manuelbuil opened this issue 9 months ago • 2 comments

What happened? A clear and concise description of what the bug is.

We are using two ipsetHandlers, one for each ip protocol:

  • nsc.ipSetHandlers[v1.IPv4Protocol]
  • nsc.ipSetHandlers[v1.IPv6Protocol]

These two ipSetHandlers are tracking ipsets of both ip families. I don't know if this is made by design or perhaps this is the bug. Anyway, as a consequence, when restoring the ipsets here, we are writing the same ipsets twice and the data that stays is the one created by ipSetHandlers[v1.IPv6Protocol] because it is the one created the last.

This logic should be functionally fine if both ipSetHandlers carried the same information, however, they are not. When setting verbosity level to 3 in order to see this log https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L569, I can observe that ipSetHandlers[v1.IPv6Protocol] uses wrong ipv4 addresses for the ipv4 ipsets

What did you expect to happen? A clear and concise description of what you expected to happen.

ipsets being correct always

How can we reproduce the behavior you experienced? Steps to reproduce the behavior: 1. Step 1 Deploy client, server, service and policy yamls from here: https://github.com/jfmontanaro/k3s-netpol-issue-demo 2. Step 2 Check this works: kubectl exec -it pod/client -- wget -O - http://whoami.default.svc.cluster.local/ 3. Step 3 Check ipsets and you will see both ipv4 and ipv6 ipsets with the correct server's IPs 4. Step 4 Remove server 5. Step 5 Deploy server again 6. Step 6 kubectl exec -it pod/client -- wget -O - http://whoami.default.svc.cluster.local does not work 7. Step 7 Check ipsets and you will see that ipv6 ipset has the correct IP but ipv4 ipset has either no IP or has the old IP

**Screenshots / Architecture Diagrams / Network Topologies ** If applicable, add those here to help explain your problem.

** System Information (please complete the following information):**

  • Kube-Router Version (kube-router --version): [e.g. 1.0.1] 2.1.1
  • Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
  • Kubernetes Version (kubectl version) : [e.g. 1.18.3] 1.29.3
  • Cloud Type: [e.g. AWS, GCP, Azure, on premise] on premise
  • Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] Via k3s
  • Kube-Router Deployment Type: [e.g. DaemonSet, System Service]
  • Cluster Size: [e.g. 200 Nodes]

** Logs, other output, metrics ** Please provide logs, other kind of output or observed metrics here.

May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: I0509 10:27:38.562250   11491 ipset.go:568] ipset restore looks like: create TMP-MOR3H7HU5JDLK6FI hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: add TMP-MOR3H7HU5JDLK6FI 10.42.1.7 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create KUBE-DST-2FAIIK2E4RIPMTGF hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: swap TMP-MOR3H7HU5JDLK6FI KUBE-DST-2FAIIK2E4RIPMTGF
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create TMP-NWERIXQHTHQDEF6A hash:ip family inet6 hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-NWERIXQHTHQDEF6A
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create inet6:KUBE-DST-I3PRO5XXEERITJZO hash:ip family inet6 hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: swap TMP-NWERIXQHTHQDEF6A inet6:KUBE-DST-I3PRO5XXEERITJZO
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-NWERIXQHTHQDEF6A
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: destroy TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: destroy TMP-NWERIXQHTHQDEF6A
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: I0509 10:27:38.595720   11491 policy.go:183] Restoring IPv4 ipset took 33.491478ms
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: I0509 10:27:38.595803   11491 ipset.go:568] MANU ipset restore looks like: create TMP-MOR3H7HU5JDLK6FI hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: add TMP-MOR3H7HU5JDLK6FI 10.42.1.6 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create KUBE-DST-2FAIIK2E4RIPMTGF hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: swap TMP-MOR3H7HU5JDLK6FI KUBE-DST-2FAIIK2E4RIPMTGF
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create TMP-NWERIXQHTHQDEF6A hash:ip family inet6 hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-NWERIXQHTHQDEF6A
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: add TMP-NWERIXQHTHQDEF6A 2001:cafe:42:1::7 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: create inet6:KUBE-DST-I3PRO5XXEERITJZO hash:ip family inet6 hashsize 1024 maxelem 65536 timeout 0
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: swap TMP-NWERIXQHTHQDEF6A inet6:KUBE-DST-I3PRO5XXEERITJZO
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: flush TMP-NWERIXQHTHQDEF6A
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: destroy TMP-MOR3H7HU5JDLK6FI
May 09 10:27:38 terraform-mbuil-vm0 k3s[11491]: destroy TMP-NWERIXQHTHQDEF6A

As you can observe, the ipv6 handler is using an old ipv4 10.42.1.6 for ipset KUBE-DST-2FAIIK2E4RIPMTGF

Additional context Add any other context about the problem here.

manuelbuil avatar May 09 '24 16:05 manuelbuil

I submitted a fix for this in #1666

I tested it with the example that you gave and it worked correctly for me. However, I'd prefer another set of eyes and another system testing it to say that it works correctly. Let me know if you're able to test the fix on the PR.

aauren avatar May 10 '24 00:05 aauren

I submitted a fix for this in #1666

I tested it with the example that you gave and it worked correctly for me. However, I'd prefer another set of eyes and another system testing it to say that it works correctly. Let me know if you're able to test the fix on the PR.

awesome @aauren! Thanks for looking at this so fast :)

manuelbuil avatar May 10 '24 05:05 manuelbuil