Failed to allocate pod ip even there're plenty available
event:
14m Warning FailedCreatePodSandBox Pod/ingress-nginx-controller-644966f9d8-mmrdb Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "566461a584386191b1cc4dcc05193d30f7120f467ab2c694e2811eda1db34247": plugin type="flannel" failed (add): failed to allocate for range 0: no IP addresses available in range set: 192.168.63.161-192.168.63.190
pods:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-controller-644966f9d8-6l8cg 1/1 Running 0 16m 192.168.63.196 192.168.63.8 <none> <none>
ingress-nginx-controller-644966f9d8-glqzs 1/1 Running 0 16m 192.168.63.131 192.168.63.10 <none> <none>
ingress-nginx-controller-644966f9d8-mmrdb 0/1 ContainerCreating 0 16m <none> 192.168.63.9 <none> <none>
kube-flannel-ds-5tdbb 1/1 Running 0 32m 192.168.63.8 192.168.63.8 <none> <none>
kube-flannel-ds-wxzdt 1/1 Running 0 32m 192.168.63.10 192.168.63.10 <none> <none>
kube-flannel-ds-zc9m9 1/1 Running 0 32m 192.168.63.9 192.168.63.9 <none> <none>
related: #1416 https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-359911304
Can you check sudo ls /var/lib/cni/networks/cbr0? I suspect that directory is full of files with the different IPs. You can remove the ones that are not used anymore. Either containerd or flannel had problems when the container was removed and "forgot" to remove that file
@manuelbuil Yeah, that's the issue, but I'm wondering if flannel should fix or mitigate this issue, it can be easily reproduced if you reboot the nodes
I rebooted the node but I can't really reproduce it. Can you explain the steps you followed to reproduce it? Note that this is more likely to be a problem of the container runtime than of the CNI plugin. The container runtime should talk to the CNI and tell it to remove everything of an endpoint, if this call does not happen, the CNI plugin can't know that the pod is gone
@manuelbuil okay, I think the issue happens when you only have some of the standard CNI plugin installed. In my case, it was bridge, host-local and loopback, the kubelet will complain the portmap plugin is missing but keeps allocating new IPs under /var/lib/cni/networks/cbr0 for some reason. BTW, why does flannel needs these CNI plugins?
That's how Flannel was built. It uses the cbr0 bridge to connect the different interfaces and the portmap to do port mapping
@manuelbuil Is there a list in the doc of which CNIs does flannel need to work?
Hi, would it make sense to create a daemonSet which cleanup IP zombies?
iirc, by design, DEL might be lost, hence having GC makes sense
We also see this issue, reasoning below. Long tests have this problem after a while (using /24 per node) The job runs lot of upgrades https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_cluster-network-addons-operator/2428/pull-e2e-cluster-network-addons-operator-lifecycle-k8s/1978786400497045504 it didnt happen with Calico for example
14m Warning FailedCreatePodSandBox pod/secondary-dns-5fc6686967-t97vf Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_secondary-dns-5fc6686967-t97vf_cluster-network-addons_9b165a5c-539f-41d4-9781-1f5ae2cb3311_0(5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402): error adding pod cluster-network-addons_secondary-dns-5fc6686967-t97vf to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402" Netns:"/var/run/netns/b30d651d-76fb-4e01-bcae-b162095eeba8" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-network-addons;K8S_POD_NAME=secondary-dns-5fc6686967-t97vf;K8S_POD_INFRA_CONTAINER_ID=5c39f1e51a02a48e2d99e0670c449379f2d45c9c631971d96f078b55931c3402;K8S_POD_UID=9b165a5c-539f-41d4-9781-1f5ae2cb3311" Path:"" ERRORED: error configuring pod [cluster-network-addons/secondary-dns-5fc6686967-t97vf] networking: [cluster-network-addons/secondary-dns-5fc6686967-t97vf/9b165a5c-539f-41d4-9781-1f5ae2cb3311:cbr0]: error adding container to network "cbr0": plugin type="flannel" failed (add): failed to allocate for range 0: no IP addresses available in range set: 10.244.0.1-10.244.0.254...
more info here https://github.com/kubevirt/kubevirtci/pull/1557#issuecomment-3448479120
TL;DR as far as i understand, it is because multus is deleted as part of the upgrade we are doing, and then DEL is lost, leaving IP zombies
As mentioned in the microk8s issue in (5267), this seems easy to reproduce in less than perfect conditions.
Alternatively could IPs be reclaimed as needed when the pool is exhausted? Can dangling IPs be detected? Thanks.
To detect and remove zombies note that this might be little raceful, as i saw some entries arent valid but cleaned after few seconds, best to map the candid dangling ones and remove them if they still exists few minutes after detection. based on https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-2132809050
For CRI-O
cd /var/lib/cni/networks/cbr0
ls -1 | grep -P '([0-9]{1,3}\.){3}[0-9]{1,3}' | while read file;do
hash=$(grep -Pom1 '\S+' $file)
short=${hash:0:13}
crictl ps -a | awk '{print $9}' | grep -q $short || rm -f $file
If desired we can try to contribute creating GC that would do it.
To avoid the reboot problem, see please https://github.com/kubernetes/kubernetes/issues/57280#issuecomment-359911304 The problem is the directory is non volatile, make it volatile and it will fix it
(i am just user of flannel that had the same problem)