amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

Node can't assign IPs for pods that haven't a specific security group attached

Open leofernandezg opened this issue 2 years ago • 9 comments

What happened:

My EKS Node can't assign IPs, but I have IP Addresses available in my subnets. I'm using Security Groups for pods in a Private EKS Cluster with EC2.

1- I'm using m5.4xlarge that using the max-pods-calculator script, should be 110 pods:

% ./max-pods-calculator.sh --instance-type m5.4xlarge --cni-version 1.11.2-eksbuild.1 --region sa-east-1

Output:

110

2- I just 12 pods in that node:

% kubectl get pods -A -o wide | grep "ip-10-7-10-62.sa-east-1.compute.internal"

amazon-cloudwatch   cloudwatch-agent-sthxp                                            0/1     ContainerCreating   0               80m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
amazon-cloudwatch   fluent-bit-h4947                                                  0/1     ContainerCreating   0               81m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
api                 my-app-847b7644fd-8lkgr                                           7/7     Running             0               13m    10.7.10.163   ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
api                 my-other-app-7f6d46fc84-ljpp4                                     4/4     Running             1 (78m ago)     78m    10.7.10.63    ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
kube-system         autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7           0/1     ContainerCreating   0               63m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
kube-system         aws-node-kpchz                                                    1/1     Running             0               42m    10.7.10.62    ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
kube-system         kube-proxy-sf9mx                                                  1/1     Running             0               81m    10.7.10.62    ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
newrelic-system     newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6               0/1     ContainerCreating   0               70m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
newrelic-system     newrelic-bundle-newrelic-logging-tdgtj                            1/1     Running             0               81m    10.7.10.62    ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
newrelic-system     newrelic-bundle-nri-prometheus-6774f7d694-n7kz6                   0/1     ContainerCreating   0               63m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
newrelic-system     newrelic-bundle-nrk8s-ksm-bd8cdcdb5-qrg52                         0/2     ContainerCreating   0               76m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>
newrelic-system     newrelic-bundle-nrk8s-kubelet-27mft                               0/2     ContainerCreating   0               77m    <none>        ip-10-7-10-62.sa-east-1.compute.internal    <none>           <none>

3- I have the classical message of "I can't assign IPs" in the pods with status ContainerCreating, here an example:

% kubectl describe pod autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7 -n kube-system

Output:

  Type     Reason                  Age                     From               Message
  ----     ------                  ----                    ----               -------
  Normal   Scheduled               53m                     default-scheduler  Successfully assigned kube-system/autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7 to ip-10-7-10-62.sa-east-1.compute.internal
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ff3f02cf5342377e66cebcf11686b2595985f2c24294960d82aa113858a16111" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "6acd46fb0725a49196af88bc1ecff1e3c3f9c51f6d014bc015e63c9d69d62cac" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "011a2eb25001641b9e7c2c929f56dfa3f8cf9331e37516792322ad671625ae7b" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "16063e5be4b96031104b0d3fd5ca9f2fdbc7d34dc57fb0e7d8f55c1198614d16" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "420d87ca269b53bd65246a4c576ab3c448ede9374aacbe52d6450fb0567d7b54" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b107a7d5cab66c9cf6dd4be01fc376f3265c65fe60832c70cad8ecdbbc72ccd0" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "76e867c2dbad6f3353c79dc7d3883429768d8a38502d22e05e6de7ca021adeb0" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4fef650a2a6330b6cbcf1008bec88fece0d773e25922a784f5935d53ac4f6e86" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  53m                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f1907df96a04a1df16dda539c710139ef9d3e7dcbbf4c61fe8ff4bf523609d73" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  8m58s (x2202 over 53m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e515712a77f1f19f77cb35f0404bd36f9ce9f894cb4974c13b2191ba1bc1688a" network for pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7": networkPlugin cni failed to set up pod "autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7_kube-system" network: add cmd: failed to assign an IP address to container
  Normal   SandboxChanged          3m58s (x2459 over 53m)  kubelet            Pod sandbox changed, it will be killed and re-created.

4- I checked the Available IPs in the subnets, but it have a lot of available IPs:

image

5- This is my configuration of my DaemonSet aws-node:

% kubectl describe daemonset aws-node -n kube-system

Selector:       k8s-app=aws-node
Node-Selector:  <none>
Labels:         k8s-app=aws-node
Annotations:    deprecated.daemonset.template.generation: 8
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 5
Number of Nodes Scheduled with Up-to-date Pods: 5
Number of Nodes Scheduled with Available Pods: 5
Number of Nodes Misscheduled: 0
Pods Status:  5 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/name=aws-node
                    k8s-app=aws-node
  Service Account:  aws-node
  Init Containers:
   aws-vpc-cni-init:
    Image:      602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon-k8s-cni-init:v1.11.2-eksbuild.1
    Port:       <none>
    Host Port:  <none>
    Environment:
      DISABLE_TCP_EARLY_DEMUX:       true
      ENABLE_IPv6:                   false
      ENABLE_POD_ENI:                true
      AWS_VPC_K8S_CNI_EXTERNALSNAT:  true
      ENABLE_PREFIX_DELEGATION:      true
      WARM_ENI_TARGET:               4
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
  Containers:
   aws-node:
    Image:      602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon-k8s-cni:v1.11.2-eksbuild.1
    Port:       61678/TCP
    Host Port:  61678/TCP
    Requests:
      cpu:      25m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=60s timeout=10s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=1s timeout=10s period=10s #success=1 #failure=3
    Environment:
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:  false
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:       prng
      ENABLE_IPv4:                         true
      ENABLE_IPv6:                         false
      MY_NODE_NAME:                         (v1:spec.nodeName)
      ENABLE_PREFIX_DELEGATION:            true
      WARM_PREFIX_TARGET:                  1
      ENABLE_POD_ENI:                      true
      AWS_VPC_K8S_CNI_EXTERNALSNAT:        true
      WARM_ENI_TARGET:                     4
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log/aws-routed-eni from log-dir (rw)
      /run/xtables.lock from xtables-lock (rw)
      /var/run/aws-node from run-dir (rw)
      /var/run/dockershim.sock from dockershim (rw)
  Volumes:
   cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
   cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
   dockershim:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/dockershim.sock
    HostPathType:  
   log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/aws-routed-eni
    HostPathType:  DirectoryOrCreate
   run-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/aws-node
    HostPathType:  DirectoryOrCreate
   xtables-lock:
    Type:               HostPath (bare host directory volume)
    Path:               /run/xtables.lock
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  29m   daemonset-controller  Created pod: aws-node-kpchz

6- All my nodes have vpc.amazonaws.com/has-trunk-attached: "true":

% kubectl get nodes -oyaml | grep 'vpc.amazonaws.com/has-trunk-attached'

Output:

      vpc.amazonaws.com/has-trunk-attached: "true"
      vpc.amazonaws.com/has-trunk-attached: "true"
      vpc.amazonaws.com/has-trunk-attached: "true"
      vpc.amazonaws.com/has-trunk-attached: "true"

7- Seeing the aws-node logs (/host/var/log/aws-routed-eni/ipamd.log) from that node, I'm seeing that it can't assign of unassign IPs in that pod:

Output:

{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.194Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" Reason:\"PodDeleted\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" Reason:\"PodDeleted\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\""}

(Complete aws-node tail): % kubectl -n kube-system exec -it aws-node-kpchz -- tail -n70 /host/var/log/aws-routed-eni/ipamd.log

{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nrk8s-kubelet-27mft\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119\" Reason:\"PodDeleted\" ContainerID:\"89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.039Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/89377accf6c4938381f00ada05f02a5cd344e680cb739629557f0ecd6476f119/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.045Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.174Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23041/ns/net, Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.174Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23041/ns/net\""}
{"level":"info","ts":"2022-07-27T14:05:10.176Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23044/ns/net, Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.176Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23044/ns/net\""}
{"level":"info","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23042/ns/net, Sandbox 6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" ContainerID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23042/ns/net\""}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.180Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.180Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.182Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.182Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.185Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.185Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.194Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"fluent-bit-h4947\" K8S_POD_NAMESPACE:\"amazon-cloudwatch\" K8S_POD_INFRA_CONTAINER_ID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" Reason:\"PodDeleted\" ContainerID:\"d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.195Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/d694437635b062d8b39bac61000ced1224419386553c30eaca2ef3ec692dc3c5/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nri-prometheus-6774f7d694-n7kz6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" Reason:\"PodDeleted\" ContainerID:\"c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.196Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/c89c806f0bf392a4e8421cc13717ae9080b65b23251a85d4d7b4b3e941b33d76/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.199Z","caller":"rpc/rpc.pb.go:731","msg":"Received DelNetwork for Sandbox 6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d"}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"rpc/rpc.pb.go:731","msg":"DelNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-kube-state-metrics-6994bd5884-gr9k6\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_ID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" Reason:\"PodDeleted\" ContainerID:\"6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: IP address pool stats: total:0, assigned 0, sandbox aws-cni/6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d/eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2022-07-27T14:05:10.199Z","caller":"ipamd/rpc_handler.go:226","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/6e40637def4987b37fdeda1fa18a89cf845e2dd085222d9a7dc0522204423d0d/unknown"}
{"level":"info","ts":"2022-07-27T14:05:10.200Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.202Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.205Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"info","ts":"2022-07-27T14:05:10.212Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23444/ns/net, Sandbox 75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.212Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"autoscaler-aws-cluster-autoscaler-chart-7b6cddcdb-xqls7\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0\" ContainerID:\"75d50f16792e281f8fac24ea6f9e70de6fbd00c405d054841750b41820a72ed0\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23444/ns/net\""}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr , IPv6Addr: , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP/Prefix addresses"}
{"level":"info","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /proc/23478/ns/net, Sandbox f505fc51dda2fce9cfd4fa713a6ff4324b5d392558ae68768d905335f1d61c43, ifname eth0"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"newrelic-bundle-nrk8s-ksm-bd8cdcdb5-qrg52\" K8S_POD_NAMESPACE:\"newrelic-system\" K8S_POD_INFRA_CONTAINER_%

8- I'm seeing both ENI that appears in the log, attached to the instance, and with the security group of the cluster:

image

9- It can be related with this PR?

But Passed 30m of the start of the node, and also deleting the pods, it can't assign IPs.

10- This happens with high frequency when I add a new node to the cluster.

My workaround, is drain it, delete it and trying with a new one. Randomly the node attach correctly the ENIs and works correctly assigning IPs.

11- IMPORTANT: I detected that this issue do not happen with pods in same node that I assigned a custom security group, just for others in the cluster which havent a specific security group associated (and it uses the one of the cluster).

12- I'm no near of a limit of quantity of ENI, my limit is 5000, and I have 492:

image

Environment:

  • Kubernetes version: v1.22
  • Instance Type: m5.4xlarge
  • CNI Version: v1.11.2-eksbuild.1
  • OS: AL2_x86_64. ami: 1.22.9-20220629

leofernandezg avatar Jul 27 '22 14:07 leofernandezg

@leofernandezg - It cannot be related to the PR pointed out since here custom networking is not enabled and SGPP is enabled. For Branch ENI pods VPC resource controller allocates Branch ENI/IP and not IPAMD.

Can you please share the describe output of one of the pods which are stuck without IP and the SecurityGroupPolicy?

jayanthvn avatar Jul 29 '22 05:07 jayanthvn

Hi @jayanthvn , Thanks you for your response.

About the security group policy, I'm not using one for this pod. (same for all with problems of IP assignation)

I think that here we have the problem,, that is not correctly assigningan IP to the pods where I haven't a securityGroup policy defined. Generally the Pods that haven't a SecurityGroupPolicy uses the Security Group of the Cluster. That pods are the one that are failing, not the ones that uses SecurityGroupPolicies

Here the output of the describe of the Pods: % k describe pod newrelic-bundle-nrk8s-kubelet-8lncl -n newrelic-system

Name:           newrelic-bundle-nrk8s-kubelet-8lncl
Namespace:      newrelic-system
Priority:       0
Node:           ip-10-7-10-62.sa-east-1.compute.internal/10.7.10.62
Start Time:     Wed, 27 Jul 2022 12:18:09 -0300
Labels:         app.kubernetes.io/component=kubelet
                app.kubernetes.io/instance=newrelic-bundle
                app.kubernetes.io/name=newrelic-infrastructure
                controller-revision-hash=b8457cc9c
                mode=privileged
                pod-template-generation=1
Annotations:    checksum/agent-config: cb5361f959c74f8bb19670c0dd33e0ba91cf912332983f4eb9cee261d495ece7
                checksum/integrations_config: c489d627fdaa3302e22b3361354902714f70d14ccf8c602a5d88cd68911ab5ee
                checksum/license-secret: 7f313c347128143b80380de85541e2ef74e5d076f208d185eb50fb8ca5a58f9d
                checksum/nri-kubernetes: 56f74c5f2f350644edc46469ae0ac141ea47268e604af54c74e64e98eb07155b
                kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  DaemonSet/newrelic-bundle-nrk8s-kubelet
Containers:
  kubelet:
    Container ID:   
    Image:          newrelic/nri-kubernetes:3.4.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  300M
    Requests:
      cpu:     100m
      memory:  150M
    Environment:
      NRI_KUBERNETES_SINK_HTTP_PORT:                       8003
      NRI_KUBERNETES_CLUSTERNAME:                          cluster-prod
      NRI_KUBERNETES_VERBOSE:                              false
      NRI_KUBERNETES_NODENAME:                              (v1:spec.nodeName)
      NRI_KUBERNETES_NODEIP:                                (v1:status.hostIP)
      NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME:          cluster-prod
      NEW_RELIC_METADATA_KUBERNETES_NODE_NAME:              (v1:spec.nodeName)
      NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME:        newrelic-system (v1:metadata.namespace)
      NEW_RELIC_METADATA_KUBERNETES_POD_NAME:              newrelic-bundle-nrk8s-kubelet-8lncl (v1:metadata.name)
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME:        kubelet
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME:  newrelic/nri-kubernetes:3.4.0
    Mounts:
      /etc/newrelic-infra/nri-kubernetes.yml from nri-kubernetes-config (rw,path="nri-kubernetes.yml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5f7s9 (ro)
  agent:
    Container ID:  
    Image:         newrelic/infrastructure-bundle:2.8.20
    Image ID:      
    Port:          8003/TCP
    Host Port:     0/TCP
    Args:
      newrelic-infra
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  300M
    Requests:
      cpu:     100m
      memory:  150M
    Environment:
      NRIA_LICENSE_KEY:                                    <set to the key 'licenseKey' in secret 'newrelic-bundle-newrelic-infrastructure-license'>  Optional: false
      NRIA_OVERRIDE_HOSTNAME_SHORT:                         (v1:spec.nodeName)
      NRIA_OVERRIDE_HOSTNAME:                               (v1:spec.nodeName)
      NRI_KUBERNETES_NODE_NAME:                             (v1:spec.nodeName)
      CLUSTER_NAME:                                        cluster-prod
      NRIA_PASSTHROUGH_ENVIRONMENT:                        CLUSTER_NAME
      NRIA_HOST:                                            (v1:status.hostIP)
      NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME:          cluster-prod
      NEW_RELIC_METADATA_KUBERNETES_NODE_NAME:              (v1:spec.nodeName)
      NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME:        newrelic-system (v1:metadata.namespace)
      NEW_RELIC_METADATA_KUBERNETES_POD_NAME:              newrelic-bundle-nrk8s-kubelet-8lncl (v1:metadata.name)
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME:        agent
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME:  newrelic/infrastructure-bundle:2.8.20
    Mounts:
      /dev from dev (rw)
      /etc/newrelic-infra.yml from config (rw,path="newrelic-infra.yml")
      /etc/newrelic-infra/integrations.d/ from nri-integrations-cfg-volume (rw)
      /host from host-volume (ro)
      /tmp from agent-tmpfs-tmp (rw)
      /var/db/newrelic-infra/data from agent-tmpfs-data (rw)
      /var/db/newrelic-infra/user_data from agent-tmpfs-user-data (rw)
      /var/log from log (rw)
      /var/run/docker.sock from host-docker-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5f7s9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  host-docker-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  
  log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  host-volume:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  agent-tmpfs-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  agent-tmpfs-user-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  agent-tmpfs-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  nri-kubernetes-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      newrelic-bundle-nrk8s-kubelet
    Optional:  false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      newrelic-bundle-nrk8s-agent-kubelet
    Optional:  false
  nri-integrations-cfg-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      newrelic-bundle-nrk8s-integrations-cfg
    Optional:  false
  kube-api-access-5f7s9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason                  Age                      From     Message
  ----     ------                  ----                     ----     -------
  Normal   SandboxChanged          7m47s (x13072 over 45h)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  2m47s (x12646 over 45h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0bd594b81af9a5f10999a00952e58ab9bff64487a49ac59cd6d781fcb4a308af" network for pod "newrelic-bundle-nrk8s-kubelet-8lncl": networkPlugin cni failed to set up pod "newrelic-bundle-nrk8s-kubelet-8lncl_newrelic-system" network: add cmd: failed to assign an IP address to container

leofernandezg avatar Jul 29 '22 12:07 leofernandezg

@leofernandezg - sure, then these are not Branch ENI pods. Can you email the logs (sudo bash /opt/cni/bin/aws-cni-support.sh) to [email protected]?

Also how many nodes you have on this subnet? Is the subnet shared with cluster which has prefix delegation and secondary mode?

I need to check why the prefixes are not getting attached to the instance -

{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-0efac8c4483ed5794 does not have available addresses"}
{"level":"debug","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"AssignPodIPv4Address: ENI eni-05d055aa998b5504d does not have available addresses"}
{"level":"error","ts":"2022-07-27T14:05:10.218Z","caller":"datastore/data_store.go:713","msg":"DataStore has no available IP/Prefix addresses"}

jayanthvn avatar Jul 29 '22 15:07 jayanthvn

@jayanthvn , Email sent.

I have 4 nodes in that subnet (all with similar quantity of pods). I that VPC and subnets I have just this cluster

leofernandezg avatar Jul 29 '22 16:07 leofernandezg

Thanks will look into it.

jayanthvn avatar Jul 29 '22 18:07 jayanthvn

Looks like there is no space to carve a /28 prefix -

{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:2128","msg":"Prefix target is 1, short of 1 prefixes, free 0 prefixes"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:997","msg":"ToAllocate: 1"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1000","msg":"Found ENI eni-0efac8c4483ed5794 that has less than the maximum number of IP/Prefixes addresses allocated: cur=0, max=29"}
{"level":"info","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1003","msg":"Trying to allocate 1 IP addresses on ENI eni-0efac8c4483ed5794"}
{"level":"debug","ts":"2022-07-27T13:08:43.699Z","caller":"ipamd/ipamd.go:1003","msg":"PD enabled - true"}
{"level":"error","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1003","msg":"Failed to allocate a private IP/Prefix addresses on ENI eni-0efac8c4483ed5794: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: ba231e5b-399e-42c6-a946-946c4bc22233"}
{"level":"warn","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:910","msg":"failed to allocate all available IPv4 Prefixes on ENI eni-0efac8c4483ed5794, err: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: ba231e5b-399e-42c6-a946-946c4bc22233"}
{"level":"info","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1007","msg":"Trying to allocate 1 IP addresses on ENI eni-0efac8c4483ed5794"}
{"level":"debug","ts":"2022-07-27T13:08:44.014Z","caller":"ipamd/ipamd.go:1007","msg":"PD enabled - true"}
{"level":"error","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:1007","msg":"Failed to allocate a private IP/Prefix addresses on ENI eni-0efac8c4483ed5794: InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.\n\tstatus code: 400, request id: 0d2c3ed9-d87b-4337-9384-da04f65f28cb"}
{"level":"debug","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:580","msg":"Insufficient IP Addresses due to: InsufficientCidrBlocks\n"}
{"level":"error","ts":"2022-07-27T13:08:44.320Z","caller":"ipamd/ipamd.go:415","msg":"Unable to attach IPs/Prefixes for the ENI, subnet doesn't seem to have enough IPs/Prefixes. Consider using new subnet or carve a reserved range using create-subnet-cidr-reservation"}

jayanthvn avatar Jul 29 '22 18:07 jayanthvn

Can you open a support case? We will involve the VPC team to check if the subnet is fragmented.

jayanthvn avatar Jul 29 '22 19:07 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Sep 28 '22 00:09 github-actions[bot]

Any updates on this? I've the same problem but I don't enforce Security Groups for pods at all. Should I open a support case as well?

zekena2 avatar Oct 06 '22 07:10 zekena2

I hit this issue too, kuectl describe has error "kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: add cmd: failed to assign an IP address to container"

Tammyxia avatar Oct 25 '22 08:10 Tammyxia

@Tammyxia I don't know about your setup but for me choosing prefix delegation was a bad idea because my subnet was small and nodes consumed all avaliable prefixes really fast + fragmentation was caused by nodes primary IP addresses.

zekena2 avatar Oct 25 '22 10:10 zekena2

@Tammyxia - Seems like you are hitting the IP limit. IPAMD logs should call out the reason. We can check the logs too. Can you please email ([email protected]) us the logs bundle?

jayanthvn avatar Oct 25 '22 14:10 jayanthvn

I tried to recreate pod and not reproduced this issue again, now all pods are running. @jayanthvn Sorry, I don't know how to get IPAMD log.

Tammyxia avatar Oct 27 '22 06:10 Tammyxia

We have the exact issue, is there any update on this?

sharon-hunters avatar Nov 02 '22 20:11 sharon-hunters

Our issue was fixed using bigger subnets.

The IPs are sufficient, but not the /28 reservations. When you use security groups for pods, each security group policy creates a new ENI that is reservating a /28 CIDR (or the configured prefix).

Then, probably you have available IPs, but not /28 reservations to create new ENIs

Like I said, we fixed it using bigger subnets.

leofernandezg avatar Nov 02 '22 23:11 leofernandezg

We had a similar issue, we managed to overcome this with changing the aws-node deamon set configuration.

What that we did was setting the following env variables on the daemon set:

  1. WARM_IP_TARGET=10
  2. MINIMUM_IP_TARGET=10

This configuration makes the aws-node ds to claim more IPs in advance for each node and overcome the delay in scheduling new pods - In our case it was specific to high frequent jobs.

Another problem that we were facing is that our subnet notation wasn't big enough to support this change through the entire cluster - out subnet notation is /20.

To overcome this problem we isolated those jobs to run on a dedicated ASG and we created another daemon set with the relevant configuration for this ASG only - We used taints and node selector to ensure that the new ds will be deployed only on that specific node group and for the original ds to be deployed on all node groups besides this one. That solution allowed us to make those changes without causing lack op IPs for the rest of the node groups - We increased the IP demand only on this specific node group.

We also operator the same infra in another region. The second cluster is bigger and has much more workloads than the one we are talking about. Over there we never came across this issue. The only relevant difference that I can think of between the clusters is the subnet notation which in the second cluster is /19.

Although while we were investigating the issue we didn't saw the relevant subnets reach 0 available IP addresses at all.

I would also recommend setting ttlSecondsAfterFinished for jobs spec in order to release IP addresses faster for completed / failed jobs. The default behavior is to keep those pods for a long period of time for logs inspection etc.. Just bare in mind that as long as u can describe a pod with an IP address assigned to it, this address is not free and can't be used again across the cluster.

asafadar avatar Nov 08 '22 16:11 asafadar