amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

VPC CNI stuck in crash loop without insights

Open duxing opened this issue 8 months ago • 11 comments

What happened:

I'm working on testing autoscaling for my EKS cluster (1.29) and karpenter is frequently scaling nodes up and down during my test. At a certain point, all newly launched nodes stuck in NotReady due to VPC CNI pod stuck in crash loop.

The symptom is very similar to hitting EC2/ENI API rate limit, however, I can't find out useful logs / metrics from the client (VPC CNI pod) to help me confirm/diagnose, despite AWS_VPC_K8S_CNI_LOGLEVEL is set to DEBUG (AWS_VPC_K8S_PLUGIN_LOG_LEVEL is also DEBUG if it matters)

The version i'm using is v1.18.1-eksbuild.3 (the EKS optimized addon) and the logs are attached below.

Attach logs

unable to run sudo bash /opt/cni/bin/aws-cni-support.sh: the image 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.1-eksbuild.3 seems to be distroless

logs from kubectl logs -f ...:

successfully launched aws-node pod (AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG):

Installed /host/opt/cni/bin/aws-cni Installed /host/opt/cni/bin/egress-cni time="2024-06-04T23:27:31Z" level=info msg="Starting IPAM daemon... " time="2024-06-04T23:27:31Z" level=info msg="Checking for IPAM connectivity... " time="2024-06-04T23:27:33Z" level=info msg="Copying config file... " time="2024-06-04T23:27:33Z" level=info msg="Successfully copied CNI plugin binary and config file."

stuck aws-node pod (also AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG):

Installed /host/opt/cni/bin/aws-cni Installed /host/opt/cni/bin/egress-cni time="2024-06-05T03:40:00Z" level=info msg="Starting IPAM daemon... " time="2024-06-05T03:40:00Z" level=info msg="Checking for IPAM connectivity... " // stuck here indefinitely until container is restarted

What you expected to happen:

with AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG, logs should be more verbose, spitting out information about what the process is doing. whatever error/exception that caused initialization failure should be surfaced to the log stream under pretty much any log level (should be ERROR log level for these entries)

if there are exponential backoff retry for 429 responses, it needs to be surfaced during verbose mode (debug log level)

How to reproduce it (as minimally and precisely as possible):

  • launch [email protected] and install VPC CNI v1.18.1-eksbuild.3 from EKS addons
  • frequently launch and terminate (after nodes become ready) nodes in batches (batch of 15 instances, every 10mins)
  • wait until nodes are unable to become healthy due to aws-node pod stuck in crash loop

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b
  • CNI Version: v1.18.1-eksbuild.3
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):

duxing avatar Jun 05 '24 03:06 duxing