amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

CNI does not wait for ENI and IP to be available before reporting ready when setting MINIMUM_IP_TARGET and WARM_IP_TARGET

Open dtnyn opened this issue 4 years ago • 14 comments
trafficstars

What happened:

After setting MINIMUM_IP_TARGET and WARM_IP_TARGET. When scheduled on a new node, pods are stuck on failed to assign an IP address to container for a few minutes before being able to be created.

From my observation I can see that after setting these override, aws-node does not wait for its datastore to be populated first before reporting ready causing pods to try to be scheduled onto these nodes and goes into a crashing state as no IP can be assigned

Pod log

Normal   SuccessfulAttachVolume  3m16s              attachdetach-controller  AttachVolume.Attach succeeded for volume "XXXXXXXX"
  Warning  FailedCreatePodSandBox  3m12s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "22cbd7067d76c2ed509436f65384e1df59ec8949e7910d7bbbfb32de8a511f89": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  3m                 kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "81a9fde8b5ea0da28959e8c08e08dc2ffcbe700c7f3067afd273e9ac67b37588": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m48s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a94f47242b45d3610a9bf25ee17de09d8fb6f77dcf1754deac13395020b1082f": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m37s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6f2f67496e2c75cdc89963afc5d4f0b47fe3d6988d0356b4cd0e0ffb641400c": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m24s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de2c20851923cc290e187ade06472c2f77595c5d55f7ee7522603fb55e9ee943": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m10s              kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4f3e15dbff1e9f6e2ad7055a61b75724d9ccd60850e05fe811966ca1cae65529": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  115s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a545894e2066ed715d5b3667c38bf9f6ecefd45c67e79da2e8016929e84c66a0": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  100s               kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9e76dd6f08280e3f246e1f366b8b0170abc345e814754797cea56e523e2f14c1": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  88s                kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5c6a3409855d54fd5f296e0793eafc84202a4e825a2cea86784b54bfb3220231": add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  75s                kubelet                  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0be78bc1e43e44a0f14a503c9cfbb0fdd9570d5aab0a6db9244e129c3ef294a9": add cmd: failed to assign an IP address to container
  Normal   Pulling                 62s                kubelet                  Pulling image "XXXXXXXXXXX"

I have been monitoring the ready state for the pod and see that it's declared ready before ipamd is able to successfully add the ENI and its IP to the datastore

The pod status change

Thu Nov  4 11:43:31 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     69s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:32 AEDT 2021
aws-node-xnpcf                                                          0/1   Running             0     70s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:33 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     72s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>
Thu Nov  4 11:43:35 AEDT 2021
aws-node-xnpcf                                                          1/1   Running             0     73s   10.1.56.83    ip-10-1-56-83.ap-southeast-2.compute.internal    <none>   <none>

Readiness

date; k describe po aws-node-xnpcf | ag Ready:
Thu Nov  4 11:43:43 AEDT 2021
    Ready:          True
    Ready:          True

The node status change

ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   38s   v1.20.11
Thu Nov  4 11:43:01 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   NotReady   <none>   39s   v1.20.11
Thu Nov  4 11:43:02 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   41s   v1.20.11
Thu Nov  4 11:43:03 AEDT 2021
ip-10-1-56-83.ap-southeast-2.compute.internal   Ready   <none>   42s   v1.20.11

Inside ipamd around that time when the pod and node are declared ready I can see pods are trying to be scheduled and assigned an IP but there's nothing in the datastore at the time

{"level":"info","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea, Sandbox b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901, ifname eth0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" ContainerID:\"b624815ee518e511c3c99bdc8e66dfd1c996fe32986fadf90d9bd5235594e901\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-42c1a3c7-8c69-e278-754d-efb63b6b33ea\" "}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 0, assigned 0"}
{"level":"debug","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignPodIPv4Address: ENI eni-0f7e7018aaead6bc1 does not have available addresses"}
{"level":"error","ts":"2021-11-04T00:43:35.680Z","caller":"ipamd/rpc_handler.go:141","msg":"DataStore has no available IP/Prefix addresses"}

It is not until a while later that ipamd successfully populate the datastore

{"level":"info","ts":"2021-11-04T00:45:49.425Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-06129a67d2970e562)'s IP/Prefix 10.7.31.56/32 to datastore"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, totalprefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T00:45:49.426Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

During this time period between aws-node reporting as ready and when ipamd datastore is actually populated with IPs, pods are being scheduled on and failing which is undesirable.

What you expected to happen:

I expected a similar behaviour observed with the default WARM_ENI_TARGET=1 behaviour where the pod is not considered to be ready until the datastore adds the attached ENI is populated with IPs.

The following logs are from using the default option with no MINIMUM_IP_TARGET and WARM_IP_TARGET override. In this case aws-cni works as expected

Pod status

Thu Nov  4 15:55:28 AEDT 2021
aws-node-7xwp2                                                          0/1   Running             0     69s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:29 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     70s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>
Thu Nov  4 15:55:31 AEDT 2021
aws-node-7xwp2                                                          1/1   Running             0     72s     10.1.56.123   ip-10-1-56-123.ap-southeast-2.compute.internal   <none>   <none>

Ready status

Thu Nov  4 15:55:26 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:27 AEDT 2021
    Ready:          True
    Ready:          False
Thu Nov  4 15:55:29 AEDT 2021
    Ready:          True
    Ready:          True

I can see in the ipamd log that the datastore successfully populate itself with the ENI's IP around this time

{"level":"info","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:917","msg":"Added ENI(eni-035e6725b4d13e122)'s IP/Prefix 10.7.30.221/32 to datastore"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:904","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:905","msg":"Datastore Pool stats: total(/32): 14, assigned(/32): 0, total prefixes(/28): 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-04T04:55:02.623Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}

Environment:

  • Kubernetes version (use kubectl version): v1.20.11
  • CNI Version: v1.9.0
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS (Focal Fossa)
  • Kernel (e.g. uname -a): Linux xxxxxxxx 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

dtnyn avatar Nov 04 '21 22:11 dtnyn

@dtnyn - What is your WARM_IP_TARGET and MINIMUM_IP_TARGET values? Do you have custom networking enabled?

With respect to WARM_IP_TARGET/MINIMUM_IP_TARGET and WARM_ENI_TARGET, there is no difference on how IPAMD starts up. We do wait for ipamd to start - https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.9/scripts/entrypoint.sh#L139 before copying cni config file - https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.9/scripts/entrypoint.sh#L145 and in ipamd init - we do try to assign IPs to the primary ENI - https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.9.0/pkg/ipamd/ipamd.go#L493. So before node ready typically we should have primary ENI with ips in non-custom networking mode.

Once you share the configs, we can try to verify this scenario locally.

jayanthvn avatar Nov 05 '21 02:11 jayanthvn

These are the config enabled for the CNI

Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s
Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s
Environment:
  AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:  true
  AWS_VPC_K8S_CNI_EXTERNALSNAT:        true
  ENI_CONFIG_LABEL_DEF:                awscni
  AWS_VPC_K8S_CNI_VETHPREFIX:          eni
  AWS_VPC_K8S_CNI_LOGLEVEL:            DEBUG
  AWS_VPC_K8S_PLUGIN_LOG_LEVEL:        DEBUG
  WARM_IP_TARGET:                      5
  MINIMUM_IP_TARGET:                   25
  MY_NODE_NAME:                         (v1:spec.nodeName)
  WATCH_NAMESPACE:                     xxxxxx (v1:metadata.namespace)

This is used on both m5.xlarge and m5a.xlarge both of which the difference behaviour between overriding WARM_ENI_TARGET can be observed.

dtnyn avatar Nov 05 '21 02:11 dtnyn

Thanks for sharing the details. Since you mentioned this is a new node, the instance will have only the primary ENI attached and during aws-node initialization IPs won't be assigned to the primary ENI since custom networking is enabled. The secondary ENI will be allocated by the reconciler based on the WARM targets requirement since available IPs will be 0. Before the reconciler thread starts node can be marked as ready and datastore will not have any IPs. But we will try both the configs locally and get back to you.

jayanthvn avatar Nov 05 '21 05:11 jayanthvn

Maybe a better description of the problem would be the difference between the init time and the time when the datastore is first populated with. We're seeing an average wait of up to 3 minutes when WARM_IP_TARGET override is enabled compared to the average 30-60 seconds of the default WARM_ENI_TARGET=1 on m5.xlarge

During that 3 waits minutes pods are stuck pending for an extended period of time since there's no IP to be allocated to them. I do sometimes see pods waiting on the default configuration with WARM_ENI_TARGET but since the instantiation time is so fast it's very much negligible

Examples:

WARM_IP_TARGET=5 MINIMUM_IP_TARGET=25

Node 1:

{"level":"info","ts":"2021-11-11T05:40:03.932Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:03.934Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:04.337Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

Node 2:

{"level":"info","ts":"2021-11-11T05:40:02.882Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:02.883Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:06.767Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

Node 3:

{"level":"info","ts":"2021-11-11T05:40:03.974Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T05:40:03.975Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"debug","ts":"2021-11-11T05:43:05.236Z","caller":"ipamd/ipamd.go:1039","msg":"Current warm IP stats: target: 5, total: 14, assigned: 0, available: 14, short: 11, over 0"}

WARM_ENI_TARGET=1 (Default)

Node 1:

{"level":"info","ts":"2021-11-11T06:12:10.002Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:10.003Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:21.139Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:21.139Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:22.510Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-4b0995b0-dbd8-f4f9-1e56-14919ad2318a, Sandbox 82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:22.511Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7\" ContainerID:\"82ea78f03a231293bf67c0d28b7219378089f644c4d8e69ac052c21dc2f942a7\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-4b0995b0-dbd8-f4f9-1e56-14919ad2318a\" "}
{"level":"debug","ts":"2021-11-11T06:12:22.511Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

Node 2:

{"level":"info","ts":"2021-11-11T06:12:09.881Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:09.882Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:20.717Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:20.717Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:22.210Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-57a72a54-6f0e-7d8a-c423-bc826298d8fa, Sandbox 330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:22.210Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9\" ContainerID:\"330d515564bd9e0382c8b786449771e67b27bb58fccb7030f5ed5b33ca55edb9\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-57a72a54-6f0e-7d8a-c423-bc826298d8fa\" "}
{"level":"debug","ts":"2021-11-11T06:12:22.210Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

Node 3:

{"level":"info","ts":"2021-11-11T06:12:10.108Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.0  ..."}
{"level":"info","ts":"2021-11-11T06:12:10.109Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"debug","ts":"2021-11-11T06:12:24.631Z","caller":"ipamd/ipamd.go:706","msg":"Successfully increased IP pool, total: 14, used: 0"}
{"level":"debug","ts":"2021-11-11T06:12:24.631Z","caller":"ipamd/ipamd.go:723","msg":"IP pool stats: total = 14, used = 0, c.maxIPsPerENI = 14"}
{"level":"info","ts":"2021-11-11T06:12:34.864Z","caller":"rpc/rpc.pb.go:501","msg":"Received AddNetwork for NS /var/run/netns/cni-285be1d5-8ad2-7a0f-29f3-0d1c91a6efb0, Sandbox 1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd, ifname eth0"}
{"level":"debug","ts":"2021-11-11T06:12:34.864Z","caller":"rpc/rpc.pb.go:501","msg":"AddNetworkRequest: ClientVersion:\"v1.9.0\" K8S_POD_NAME:\"xxxxxxx\" K8S_POD_NAMESPACE:\"xxxxxxx\" K8S_POD_INFRA_CONTAINER_ID:\"1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd\" ContainerID:\"1ece564a1c9ea4e7c311950efda30949538847f30a98eabf90fa3651c3b6adfd\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-285be1d5-8ad2-7a0f-29f3-0d1c91a6efb0\" "}
{"level":"debug","ts":"2021-11-11T06:12:34.864Z","caller":"ipamd/rpc_handler.go:141","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 0"}

dtnyn avatar Nov 11 '21 06:11 dtnyn

@dtnyn Have you given the new IP prefix's a try/do they have similar delay? I expect I might be experiencing similar delays as we have the same type of config in our clusters.

sidewinder12s avatar Jan 26 '22 00:01 sidewinder12s

I did a test on my local cluster to compare the time difference between init time and when datastore has IPs and didn't find any difference with respect to the configs -

WARM_ENI_TARGET = 1

Init start ->

{"level":"info","ts":"2022-01-26T06:30:00.329Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.3  ..."}

Reconciler increasing the pool and added a secondary ENI ->

{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:628","msg":"IP pool is too low: available (0) < ENI target (1) * addrsPerENI (3)"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:629","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:899","msg":"Skip the primary ENI for need IP check"}
{"level":"info","ts":"2022-01-26T06:30:05.052Z","caller":"eniconfig/eniconfig.go:68","msg":"Get Node Info for: ip-192-168-30-100.us-west-2.compute.internal"}
{"level":"debug","ts":"2022-01-26T06:30:05.052Z","caller":"eniconfig/eniconfig.go:129","msg":"Using ENI_CONFIG_LABEL_DEF failure-domain.beta.kubernetes.io/zone"}
{"level":"info","ts":"2022-01-26T06:30:05.052Z","caller":"ipamd/ipamd.go:806","msg":"Found ENI Config Name: us-west-2a"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"ipamd/ipamd.go:780","msg":"ipamd: using custom network config: [], subnet-0eeb2ea845de1c8d3"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"Using a custom network config for the new ENI"}
{"level":"warn","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"No custom networking security group found, will use the node's primary ENI's SG: [%!s(*string=0xc0006cd760) %!s(*string=0xc0006cd770)]"}
{"level":"info","ts":"2022-01-26T06:30:05.153Z","caller":"awsutils/awsutils.go:689","msg":"Creating ENI with security groups: [sg-04123967cfc7fea84 sg-0c72b30626f1cbaf3] in subnet: subnet-0eeb2ea845de1c8d3"}
{"level":"info","ts":"2022-01-26T06:30:05.608Z","caller":"awsutils/awsutils.go:689","msg":"Created a new ENI: eni-07c415ca2e05f9b9d"}
{"level":"debug","ts":"2022-01-26T06:30:05.709Z","caller":"awsutils/awsutils.go:732","msg":"Discovered device number is used: 0"}
{"level":"debug","ts":"2022-01-26T06:30:05.709Z","caller":"awsutils/awsutils.go:732","msg":"Found a free device number: 1"}
{"level":"info","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:821","msg":"Successfully created and attached a new ENI eni-07c415ca2e05f9b9d to instance"}
{"level":"info","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:830","msg":"Trying to allocate 3 IP addresses on ENI eni-07c415ca2e05f9b9d"}
{"level":"debug","ts":"2022-01-26T06:30:07.040Z","caller":"ipamd/ipamd.go:830","msg":"PD enabled - false"}
{"level":"info","ts":"2022-01-26T06:30:07.535Z","caller":"ipamd/ipamd.go:830","msg":"Allocated 3 private IP addresses"}

WARM_IP_TARGET = 1 and MINIMUM_IP_TARGET = 3

Init start ->

{"level":"info","ts":"2022-01-26T06:57:17.253Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.3  ..."}

Reconciler increasing the pool and added a secondary ENI ->

{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:629","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:737","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:862","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:886","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.208Z","caller":"ipamd/ipamd.go:899","msg":"Skip the primary ENI for need IP check"}
{"level":"info","ts":"2022-01-26T06:57:21.208Z","caller":"eniconfig/eniconfig.go:68","msg":"Get Node Info for: ip-192-168-33-140.us-west-2.compute.internal"}
{"level":"debug","ts":"2022-01-26T06:57:21.209Z","caller":"eniconfig/eniconfig.go:129","msg":"Using ENI_CONFIG_LABEL_DEF failure-domain.beta.kubernetes.io/zone"}
{"level":"info","ts":"2022-01-26T06:57:21.209Z","caller":"ipamd/ipamd.go:806","msg":"Found ENI Config Name: us-west-2b"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"ipamd/ipamd.go:780","msg":"ipamd: using custom network config: [], subnet-0747217e1d86e6510"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"Using a custom network config for the new ENI"}
{"level":"warn","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"No custom networking security group found, will use the node's primary ENI's SG: [%!s(*string=0xc00024c5e0) %!s(*string=0xc00024c5f0)]"}
{"level":"info","ts":"2022-01-26T06:57:21.309Z","caller":"awsutils/awsutils.go:689","msg":"Creating ENI with security groups: [sg-04123967cfc7fea84 sg-0c72b30626f1cbaf3] in subnet: subnet-0747217e1d86e6510"}
{"level":"info","ts":"2022-01-26T06:57:21.723Z","caller":"awsutils/awsutils.go:689","msg":"Created a new ENI: eni-052e27ef0393de5ca"}
{"level":"debug","ts":"2022-01-26T06:57:21.824Z","caller":"awsutils/awsutils.go:732","msg":"Discovered device number is used: 0"}
{"level":"debug","ts":"2022-01-26T06:57:21.824Z","caller":"awsutils/awsutils.go:732","msg":"Found a free device number: 1"}
{"level":"info","ts":"2022-01-26T06:57:23.135Z","caller":"ipamd/ipamd.go:821","msg":"Successfully created and attached a new ENI eni-052e27ef0393de5ca to instance"}
{"level":"debug","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:1899","msg":"Current warm IP stats: target: 1, total: 0, assigned: 0, available: 0, short: 3, over 0"}
{"level":"info","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:830","msg":"Trying to allocate 3 IP addresses on ENI eni-052e27ef0393de5ca"}
{"level":"debug","ts":"2022-01-26T06:57:23.136Z","caller":"ipamd/ipamd.go:830","msg":"PD enabled - false"}
{"level":"info","ts":"2022-01-26T06:57:23.554Z","caller":"ipamd/ipamd.go:830","msg":"Allocated 3 private IP addresses"}

Both the cases within 4-5 seconds reconciler tried to increase the pool. There is no difference in the code path for both the cases. Once the reconciler starts based on the WARM targets pool will be increased.

But I feel there is one optimization that can be done here, since you are using custom networking and primary ENI cannot have IPs/Prefixes attached we should go ahead and attach secondary ENI based on the WARM targets and that should help with the pod launch time otherwise we will have to wait for the reconciler to increase the pool and node will be marked as ready.

jayanthvn avatar Jan 26 '22 07:01 jayanthvn

@sidewinder12s We have tried prefix mode but did not do extended testing due to the minimum allocation size of /28 which doesn't fit our use-case.

@jayanthvn I believe that your proposed optimization, attaching a new ENI with pre-allocated IP, will best for our current problem. Since we can observe that the ENI attachment and the time starting the reconcile are very fast. Only the process of the initial reconiller adding new IP to the ENI that we see is very slow. Any subsequent reconciller processes on the the same ENI is fast.

dtnyn avatar Jan 27 '22 03:01 dtnyn

@dtnyn - Yes that's right. I have the PR and testing it out locally.

jayanthvn avatar Jan 27 '22 03:01 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 29 '22 00:04 github-actions[bot]

@jayanthvn we're experiencing this issue as well. How has your local testing been going?

CameronHall avatar May 05 '22 12:05 CameronHall

I am waiting for the code review.

jayanthvn avatar May 05 '22 23:05 jayanthvn

+1

leofernandezg avatar Jul 14 '22 20:07 leofernandezg

Any updates on PR Progress?

sidewinder12s avatar Sep 12 '22 20:09 sidewinder12s

We had discussed few changes for the PR but couldn't progress because of other commitments. I will take it up next week for changes/review and track it for 1.11.5 or 1.12.0 release. Will provide the release timelines once the PR is merged.

jayanthvn avatar Sep 12 '22 20:09 jayanthvn

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Nov 12 '22 00:11 github-actions[bot]

Not stale. We still regularly see this issue.

sidewinder12s avatar Nov 14 '22 15:11 sidewinder12s