amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
Node can't assign IPs for pods that haven't a specific security group attached
What happened:
My EKS Node can't assign IPs, but I have IP Addresses available in my subnets. I'm using Security Groups for pods in a Private EKS Cluster with EC2.
1- I'm using m5.4xlarge that using the max-pods-calculator script, should be 110 pods
:
% ./max-pods-calculator.sh --instance-type m5.4xlarge --cni-version 1.11.2-eksbuild.1 --region sa-east-1
Output:
110
2- I just 12 pods
in that node:
% kubectl get pods -A -o wide | grep "ip-10-7-10-62.sa-east-1.compute.internal"
amazon-cloudwatch cloudwatch-agent-sthxp 0/1 ContainerCreating 0 80m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
amazon-cloudwatch fluent-bit-h4947 0/1 ContainerCreating 0 81m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
api my-app-847b7644fd-8lkgr 7/7 Running 0 13m 10.7.10.163 ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
api my-other-app-7f6d46fc84-ljpp4 4/4 Running 1 (78m ago) 78m 10.7.10.63 ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
kube-system autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7 0/1 ContainerCreating 0 63m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
kube-system aws-node-kpchz 1/1 Running 0 42m 10.7.10.62 ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
kube-system kube-proxy-sf9mx 1/1 Running 0 81m 10.7.10.62 ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
newrelic-system newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6 0/1 ContainerCreating 0 70m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
newrelic-system newrelic-bundle-newrelic-logging-tdgtj 1/1 Running 0 81m 10.7.10.62 ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
newrelic-system newrelic-bundle-nri-prometheus-6774f7d694-n7kz6 0/1 ContainerCreating 0 63m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
newrelic-system newrelic-bundle-nrk8s-ksm-bd8cdcdb5-qrg52 0/2 ContainerCreating 0 76m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
newrelic-system newrelic-bundle-nrk8s-kubelet-27mft 0/2 ContainerCreating 0 77m <none> ip-10-7-10-62.sa-east-1.compute.internal <none> <none>
3- I have the classical message of "I can't assign IPs" in the pods with status ContainerCreating
, here an example:
% kubectl describe pod autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7 -n kube-system
Output:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53m default-scheduler Successfully assigned kube-system/autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7 to ip-10-7-10-62.sa-east-1.compute.internal
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ff3f02cf5342377e66cebcf11686b2595985f2c24294960d82aa113858a16111" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "6acd46fb0725a49196af88bc1ecff1e3c3f9c51f6d014bc015e63c9d69d62cac" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "011a2eb25001641b9e7c2c929f56dfa3f8cf9331e37516792322ad671625ae7b" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "16063e5be4b96031104b0d3fd5ca9f2fdbc7d34dc57fb0e7d8f55c1198614d16" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "420d87ca269b53bd65246a4c576ab3c448ede9374aacbe52d6450fb0567d7b54" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b107a7d5cab66c9cf6dd4be01fc376f3265c65fe60832c70cad8ecdbbc72ccd0" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "76e867c2dbad6f3353c79dc7d3883429768d8a38502d22e05e6de7ca021adeb0" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4fef650a2a6330b6cbcf1008bec88fece0d773e25922a784f5935d53ac4f6e86" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f1907df96a04a1df16dda539c710139ef9d3e7dcbbf4c61fe8ff4bf523609d73" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 8m58s (x2202 over 53m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e515712a77f1f19f77cb35f0404bd36f9ce9f894cb4974c13b2191ba1bc1688a" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
Normal SandboxChanged 3m58s (x2459 over 53m) kubelet Pod sandbox changed, it will be killed and re-created.
4- I checked the Available IPs in the subnets, but it have a lot of available IPs:

5- This is my configuration of my DaemonSet aws-node
:
% kubectl describe daemonset aws-node -n kube-system
Selector: k8s-app=aws-node
Node-Selector: <none>
Labels: k8s-app=aws-node
Annotations: deprecated.daemonset.template.generation: 8
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 5
Number of Nodes Scheduled with Up-to-date Pods: 5
Number of Nodes Scheduled with Available Pods: 5
Number of Nodes Misscheduled: 0
Pods Status: 5 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/name=aws-node
k8s-app=aws-node
Service Account: aws-node
Init Containers:
aws-vpc-cni-init:
Image: 602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon-k8s-cni-init:v1.11.2-eksbuild.1
Port: <none>
Host Port: <none>
Environment:
DISABLE_TCP_EARLY_DEMUX: true
ENABLE_IPv6: false
ENABLE_POD_ENI: true
AWS_VPC_K8S_CNI_EXTERNALSNAT: true
ENABLE_PREFIX_DELEGATION: true
WARM_ENI_TARGET: 4
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
Containers:
aws-node:
Image: 602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon-k8s-cni:v1.11.2-eksbuild.1
Port: 61678/TCP
Host Port: 61678/TCP
Requests:
cpu: 25m
Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=60s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=1s timeout=10s period=10s #success=1 #failure=3
Environment:
AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER: false
AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
ENABLE_IPv4: true
ENABLE_IPv6: false
MY_NODE_NAME: (v1:spec.nodeName)
ENABLE_PREFIX_DELEGATION: true
WARM_PREFIX_TARGET: 1
ENABLE_POD_ENI: true
AWS_VPC_K8S_CNI_EXTERNALSNAT: true
WARM_ENI_TARGET: 4
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/host/var/log/aws-routed-eni from log-dir (rw)
/run/xtables.lock from xtables-lock (rw)
/var/run/aws-node from run-dir (rw)
/var/run/dockershim.sock from dockershim (rw)
Volumes:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
dockershim:
Type: HostPath (bare host directory volume)
Path: /var/run/dockershim.sock
HostPathType:
log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/aws-routed-eni
HostPathType: DirectoryOrCreate
run-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/aws-node
HostPathType: DirectoryOrCreate
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType:
Priority Class Name: system-node-critical
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 29m daemonset-controller Created pod: aws-node-kpchz
6- All my nodes have vpc.amazonaws.com/has-trunk-attached: "true"
:
% kubectl get nodes -oyaml | grep 'vpc.amazonaws.com/has-trunk-attached'
Output:
vpc.amazonaws.com/has-trunk-attached: "true"
vpc.amazonaws.com/has-trunk-attached: "true"
vpc.amazonaws.com/has-trunk-attached: "true"
vpc.amazonaws.com/has-trunk-attached: "true"
7- Seeing the aws-node logs (/host/var/log/aws-routed-eni/ipamd.log
) from that node, I'm seeing that it can't assign of unassign IPs in that pod:
Output:
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.194Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" Reason:\"PodDeleted\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" Reason:\"PodDeleted\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
(Complete aws-node tail):
% kubectl -n kube-system exec -it aws-node-kpchz -- tail -n70 /host/var/log/aws-routed-eni/ipamd.log
{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nrk8s-kubelet-27mft\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119\" Reason:\"PodDeleted\" ContainerID:\"89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.045Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.174Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23041/ns/net, Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.174Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23041/ns/net\""}
{"level":"info","ts":"2022-07-27T14:05:10.176Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23044/ns/net, Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.176Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23044/ns/net\""}
{"level":"info","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23042/ns/net, Sandbox 6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" ContainerID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23042/ns/net\""}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.182Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.185Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.194Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" Reason:\"PodDeleted\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" Reason:\"PodDeleted\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.199Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox 6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d"}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" Reason:\"PodDeleted\" ContainerID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.200Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.202Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.205Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.212Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23444/ns/net, Sandbox 75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.212Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0\" ContainerID:\"75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23444/ns/net\""}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23478/ns/net, Sandbox f505fc51dda2fce9cfd4fa713a6ff4324b5d392558ae68768d905335f1d61c43, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nrk8s-ksm-bd8cdcdb5-qrg52\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_%
8- I'm seeing both ENI that appears in the log, attached to the instance, and with the security group of the cluster:

9- It can be related with this PR?
But Passed 30m of the start of the node, and also deleting the pods, it can't assign IPs.
10- This happens with high frequency when I add a new node to the cluster.
My workaround, is drain it, delete it and trying with a new one. Randomly the node attach correctly the ENIs and works correctly assigning IPs.
11- IMPORTANT: I detected that this issue do not happen with pods in same node that I assigned a custom security group, just for others in the cluster which havent a specific security group associated (and it uses the one of the cluster).
12- I'm no near of a limit of quantity of ENI, my limit is 5000
, and I have 492
:

Environment:
- Kubernetes version:
v1.22
- Instance Type:
m5.4xlarge
- CNI Version:
v1.11.2-eksbuild.1
- OS:
AL2_x86_64. ami: 1.22.9-20220629
@leofernandezg - It cannot be related to the PR pointed out since here custom networking is not enabled and SGPP is enabled. For Branch ENI pods VPC resource controller allocates Branch ENI/IP and not IPAMD.
Can you please share the describe output of one of the pods which are stuck without IP and the SecurityGroupPolicy?
Hi @jayanthvn , Thanks you for your response.
About the security group policy, I'm not using one for this pod. (same for all with problems of IP assignation)
I think that here we have the problem,, that is not correctly assigningan IP to the pods where I haven't a securityGroup policy defined. Generally the Pods that haven't a SecurityGroupPolicy uses the Security Group of the Cluster. That pods are the one that are failing, not the ones that uses SecurityGroupPolicies
Here the output of the describe of the Pods:
% k describe pod newrelic-bundle-nrk8s-kubelet-8lncl -n newrelic-system
Name: newrelic-bundle-nrk8s-kubelet-8lncl
Namespace: newrelic-system
Priority: 0
Node: ip-10-7-10-62.sa-east-1.compute.internal/10.7.10.62
Start Time: Wed, 27 Jul 2022 12:18:09 -0300
Labels: app.kubernetes.io/component=kubelet
app.kubernetes.io/instance=newrelic-bundle
app.kubernetes.io/name=newrelic-infrastructure
controller-revision-hash=b8457cc9c
mode=privileged
pod-template-generation=1
Annotations: checksum/agent-config: cb5361f959c74f8bb19670c0dd33e0ba91cf912332983f4eb9cee261d495ece7
checksum/integrations_config: c489d627fdaa3302e22b3361354902714f70d14ccf8c602a5d88cd68911ab5ee
checksum/license-secret: 7f313c347128143b80380de85541e2ef74e5d076f208d185eb50fb8ca5a58f9d
checksum/nri-kubernetes: 56f74c5f2f350644edc46469ae0ac141ea47268e604af54c74e64e98eb07155b
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/newrelic-bundle-nrk8s-kubelet
Containers:
kubelet:
Container ID:
Image: newrelic/nri-kubernetes:3.4.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
memory: 300M
Requests:
cpu: 100m
memory: 150M
Environment:
NRI_KUBERNETES_SINK_HTTP_PORT: 8003
NRI_KUBERNETES_CLUSTERNAME: cluster-prod
NRI_KUBERNETES_VERBOSE: false
NRI_KUBERNETES_NODENAME: (v1:spec.nodeName)
NRI_KUBERNETES_NODEIP: (v1:status.hostIP)
NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME: cluster-prod
NEW_RELIC_METADATA_KUBERNETES_NODE_NAME: (v1:spec.nodeName)
NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME: newrelic-system (v1:metadata.namespace)
NEW_RELIC_METADATA_KUBERNETES_POD_NAME: newrelic-bundle-nrk8s-kubelet-8lncl (v1:metadata.name)
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME: kubelet
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME: newrelic/nri-kubernetes:3.4.0
Mounts:
/etc/newrelic-infra/nri-kubernetes.yml from nri-kubernetes-config (rw,path="nri-kubernetes.yml")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5f7s9 (ro)
agent:
Container ID:
Image: newrelic/infrastructure-bundle:2.8.20
Image ID:
Port: 8003/TCP
Host Port: 0/TCP
Args:
newrelic-infra
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
memory: 300M
Requests:
cpu: 100m
memory: 150M
Environment:
NRIA_LICENSE_KEY: <set to the key 'licenseKey' in secret 'newrelic-bundle-newrelic-infrastructure-license'> Optional: false
NRIA_OVERRIDE_HOSTNAME_SHORT: (v1:spec.nodeName)
NRIA_OVERRIDE_HOSTNAME: (v1:spec.nodeName)
NRI_KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CLUSTER_NAME: cluster-prod
NRIA_PASSTHROUGH_ENVIRONMENT: CLUSTER_NAME
NRIA_HOST: (v1:status.hostIP)
NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME: cluster-prod
NEW_RELIC_METADATA_KUBERNETES_NODE_NAME: (v1:spec.nodeName)
NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME: newrelic-system (v1:metadata.namespace)
NEW_RELIC_METADATA_KUBERNETES_POD_NAME: newrelic-bundle-nrk8s-kubelet-8lncl (v1:metadata.name)
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME: agent
NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME: newrelic/infrastructure-bundle:2.8.20
Mounts:
/dev from dev (rw)
/etc/newrelic-infra.yml from config (rw,path="newrelic-infra.yml")
/etc/newrelic-infra/integrations.d/ from nri-integrations-cfg-volume (rw)
/host from host-volume (ro)
/tmp from agent-tmpfs-tmp (rw)
/var/db/newrelic-infra/data from agent-tmpfs-data (rw)
/var/db/newrelic-infra/user_data from agent-tmpfs-user-data (rw)
/var/log from log (rw)
/var/run/docker.sock from host-docker-socket (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5f7s9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
host-docker-socket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
host-volume:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
agent-tmpfs-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
agent-tmpfs-user-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
agent-tmpfs-tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
nri-kubernetes-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: newrelic-bundle-nrk8s-kubelet
Optional: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: newrelic-bundle-nrk8s-agent-kubelet
Optional: false
nri-integrations-cfg-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: newrelic-bundle-nrk8s-integrations-cfg
Optional: false
kube-api-access-5f7s9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 7m47s (x13072 over 45h) kubelet Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 2m47s (x12646 over 45h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0bd594b81af9a5f10999a00952e58ab9bff64487a49ac59cd6d781fcb4a308af" network for pod "newrelic-bundle-nrk8s-kubelet-8lncl": networkPlugin cni failed to set up pod "newrelic-bundle-nrk8s-kubelet-8lncl_newrelic-system" network: add cmd: failed to assign an IP address to container
@leofernandezg - sure, then these are not Branch ENI pods. Can you email the logs (sudo bash /opt/cni/bin/aws-cni-support.sh
) to [email protected]?
Also how many nodes you have on this subnet? Is the subnet shared with cluster which has prefix delegation and secondary mode?
I need to check why the prefixes are not getting attached to the instance -
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
@jayanthvn , Email sent.
I have 4 nodes in that subnet (all with similar quantity of pods). I that VPC and subnets I have just this cluster
Thanks will look into it.
Looks like there is no space to carve a /28 prefix -
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:2128","msg":"Prefix target is 1, short of 1 prefixes, free 0 prefixes"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:997","msg":"ToAllocate: 1"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1000","msg":"Found ENI eni-0efac8c4483ed5794 that has less than the maximum number of IP/Prefixes addresses allocated: cur=0, max=29"}
{"level":"info","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1003","msg":"Trying to allocate 1 IP addresses on ENI eni-0efac8c4483ed5794"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1003","msg":"PD enabled - true"}
{"level":"error","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1003","msg":"Failed to allocate a private IP/Prefix addresses on ENI eni-0efac8c4483ed5794: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: ba231e5b-399e-42c6-a946-946c4bc22233"}
{"level":"warn","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:910","msg":"failed to allocate all available IPv4 Prefixes on ENI eni-0efac8c4483ed5794, err: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: ba231e5b-399e-42c6-a946-946c4bc22233"}
{"level":"info","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1007","msg":"Trying to allocate 1 IP addresses on ENI eni-0efac8c4483ed5794"}
{"level":"debug","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1007","msg":"PD enabled - true"}
{"level":"error","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:1007","msg":"Failed to allocate a private IP/Prefix addresses on ENI eni-0efac8c4483ed5794: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: 0d2c3ed9-d87b-4337-9384-da04f65f28cb"}
{"level":"debug","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:580","msg":"Insufficient IP Addresses due to: InsufficientCidrBlocks\n"}
{"level":"error","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:415","msg":"Unable to attach IPs/Prefixes for the ENI, subnet doesn't seem to have enough IPs/Prefixes. Consider using new subnet or carve a reserved range using create-subnet-cidr-reservation"}
Can you open a support case? We will involve the VPC team to check if the subnet is fragmented.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Any updates on this? I've the same problem but I don't enforce Security Groups for pods at all. Should I open a support case as well?
I hit this issue too, kuectl describe has error "kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: add cmd: failed to assign an IP address to container"
@Tammyxia I don't know about your setup but for me choosing prefix delegation was a bad idea because my subnet was small and nodes consumed all avaliable prefixes really fast + fragmentation was caused by nodes primary IP addresses.
@Tammyxia - Seems like you are hitting the IP limit. IPAMD logs should call out the reason. We can check the logs too. Can you please email ([email protected]) us the logs bundle?
I tried to recreate pod and not reproduced this issue again, now all pods are running. @jayanthvn Sorry, I don't know how to get IPAMD log.
We have the exact issue, is there any update on this?
Our issue was fixed using bigger subnets.
The IPs are sufficient, but not the /28 reservations. When you use security groups for pods, each security group policy creates a new ENI that is reservating a /28 CIDR (or the configured prefix).
Then, probably you have available IPs, but not /28 reservations to create new ENIs
Like I said, we fixed it using bigger subnets.
We had a similar issue, we managed to overcome this with changing the aws-node
deamon set configuration.
What that we did was setting the following env variables on the daemon set:
- WARM_IP_TARGET=10
- MINIMUM_IP_TARGET=10
This configuration makes the aws-node
ds to claim more IPs in advance for each node and overcome the delay in scheduling new pods - In our case it was specific to high frequent jobs.
Another problem that we were facing is that our subnet notation wasn't big enough to support this change through the entire cluster - out subnet notation is /20.
To overcome this problem we isolated those jobs to run on a dedicated ASG and we created another daemon set with the relevant configuration for this ASG only - We used taints and node selector to ensure that the new ds will be deployed only on that specific node group and for the original ds to be deployed on all node groups besides this one. That solution allowed us to make those changes without causing lack op IPs for the rest of the node groups - We increased the IP demand only on this specific node group.
We also operator the same infra in another region. The second cluster is bigger and has much more workloads than the one we are talking about. Over there we never came across this issue. The only relevant difference that I can think of between the clusters is the subnet notation which in the second cluster is /19.
Although while we were investigating the issue we didn't saw the relevant subnets reach 0 available IP addresses at all.
I would also recommend setting ttlSecondsAfterFinished
for jobs spec in order to release IP addresses faster for completed / failed jobs. The default behavior is to keep those pods for a long period of time for logs inspection etc.. Just bare in mind that as long as u can describe a pod with an IP address assigned to it, this address is not free and can't be used again across the cluster.