flannel icon indicating copy to clipboard operation
flannel copied to clipboard

Failed to allocate pod ip even there're plenty available

Open ttc0419 opened this issue 5 months ago • 10 comments

event:

14m                    Warning   FailedCreatePodSandBox   Pod/ingress-nginx-controller-644966f9d8-mmrdb   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "566461a584386191b1cc4dcc05193d30f7120f467ab2c694e2811eda1db34247": plugin type="flannel" failed (add): failed to allocate for range 0: no IP addresses available in range set: 192.168.63.161-192.168.63.190

pods:

NAME                                        READY   STATUS              RESTARTS   AGE   IP               NODE            NOMINATED NODE   READINESS GATES
ingress-nginx-controller-644966f9d8-6l8cg   1/1     Running             0          16m   192.168.63.196   192.168.63.8    <none>           <none>
ingress-nginx-controller-644966f9d8-glqzs   1/1     Running             0          16m   192.168.63.131   192.168.63.10   <none>           <none>
ingress-nginx-controller-644966f9d8-mmrdb   0/1     ContainerCreating   0          16m   <none>           192.168.63.9    <none>           <none>
kube-flannel-ds-5tdbb                       1/1     Running             0          32m   192.168.63.8     192.168.63.8    <none>           <none>
kube-flannel-ds-wxzdt                       1/1     Running             0          32m   192.168.63.10    192.168.63.10   <none>           <none>
kube-flannel-ds-zc9m9                       1/1     Running             0          32m   192.168.63.9     192.168.63.9    <none>           <none>

related: #1416 https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-359911304

ttc0419 avatar Jul 20 '25 09:07 ttc0419

Can you check sudo ls /var/lib/cni/networks/cbr0? I suspect that directory is full of files with the different IPs. You can remove the ones that are not used anymore. Either containerd or flannel had problems when the container was removed and "forgot" to remove that file

manuelbuil avatar Jul 21 '25 09:07 manuelbuil

@manuelbuil Yeah, that's the issue, but I'm wondering if flannel should fix or mitigate this issue, it can be easily reproduced if you reboot the nodes

ttc0419 avatar Jul 21 '25 09:07 ttc0419

I rebooted the node but I can't really reproduce it. Can you explain the steps you followed to reproduce it? Note that this is more likely to be a problem of the container runtime than of the CNI plugin. The container runtime should talk to the CNI and tell it to remove everything of an endpoint, if this call does not happen, the CNI plugin can't know that the pod is gone

manuelbuil avatar Jul 21 '25 14:07 manuelbuil

@manuelbuil okay, I think the issue happens when you only have some of the standard CNI plugin installed. In my case, it was bridge, host-local and loopback, the kubelet will complain the portmap plugin is missing but keeps allocating new IPs under /var/lib/cni/networks/cbr0 for some reason. BTW, why does flannel needs these CNI plugins?

ttc0419 avatar Jul 21 '25 16:07 ttc0419

That's how Flannel was built. It uses the cbr0 bridge to connect the different interfaces and the portmap to do port mapping

manuelbuil avatar Jul 22 '25 09:07 manuelbuil

@manuelbuil Is there a list in the doc of which CNIs does flannel need to work?

ttc0419 avatar Jul 22 '25 13:07 ttc0419

Hi, would it make sense to create a daemonSet which cleanup IP zombies?

iirc, by design, DEL might be lost, hence having GC makes sense

We also see this issue, reasoning below. Long tests have this problem after a while (using /24 per node) The job runs lot of upgrades https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_cluster-network-addons-operator/2428/pull-e2e-cluster-network-addons-operator-lifecycle-k8s/1978786400497045504 it didnt happen with Calico for example

14m         Warning   FailedCreatePodSandBox   pod/secondary-dns-5fc6686967-t97vf   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_secondary-dns-5fc6686967-t97vf_cluster-network-addons_9b165a5c-539f-41d4-9781-1f5ae2cb3311_0(5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402): error adding pod cluster-network-addons_secondary-dns-5fc6686967-t97vf to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402" Netns:"/var/run/netns/b30d651d-76fb-4e01-bcae-b162095eeba8" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-network-addons;K8S_POD_NAME=secondary-dns-5fc6686967-t97vf;K8S_POD_INFRA_CONTAINER_ID=5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402;K8S_POD_UID=9b165a5c-539f-41d4-9781-1f5ae2cb3311" Path:"" ERRORED: error configuring pod [cluster-network-addons/secondary-dns-5fc6686967-t97vf] networking: [cluster-network-addons/secondary-dns-5fc6686967-t97vf/9b165a5c-539f-41d4-9781-1f5ae2cb3311:cbr0]: error adding container to network "cbr0": plugin type="flannel" failed (add): failed to allocate for range 0: no IP addresses available in range set: 10.244.0.1-10.244.0.254...

oshoval avatar Oct 20 '25 15:10 oshoval

more info here https://github.com/kubevirt/kubevirtci/pull/1557#issuecomment-3448479120

TL;DR as far as i understand, it is because multus is deleted as part of the upgrade we are doing, and then DEL is lost, leaving IP zombies

oshoval avatar Oct 26 '25 12:10 oshoval

As mentioned in the microk8s issue in (5267), this seems easy to reproduce in less than perfect conditions.

Alternatively could IPs be reclaimed as needed when the pool is exhausted? Can dangling IPs be detected? Thanks.

d-shehu avatar Oct 27 '25 16:10 d-shehu

To detect and remove zombies note that this might be little raceful, as i saw some entries arent valid but cleaned after few seconds, best to map the candid dangling ones and remove them if they still exists few minutes after detection. based on https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-2132809050

For CRI-O

cd /var/lib/cni/networks/cbr0
ls -1 | grep -P '([0-9]{1,3}\.){3}[0-9]{1,3}' | while read file;do
  hash=$(grep -Pom1 '\S+' $file)
  short=${hash:0:13}
  crictl ps -a | awk '{print $9}' | grep -q $short || rm -f $file

If desired we can try to contribute creating GC that would do it.

To avoid the reboot problem, see please https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-359911304 The problem is the directory is non volatile, make it volatile and it will fix it

(i am just user of flannel that had the same problem)

oshoval avatar Oct 27 '25 16:10 oshoval