amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

Conflict between EKS Best Practices and latest AL2023 AMI.

Open dawgware opened this issue 11 months ago • 9 comments

While upgrading one of our EKS clusters to 1.31 we experienced an issue with new 1.31 nodes not joining the cluster.
In the logs we saw the following for containerd:

Jan 23 03:55:48 ip-10-0-130-143.ec2.internal systemd[1]: Started containerd.service - containerd container runtime.
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.855808769Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:efs-csi-node-5fdcs,Uid:dc3211ea-b057-4c13-8d05-598cabb16988,Namespace:kube-system,Attempt:0,}"
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.859534903Z" level=info msg="trying next host" error="failed to do request: Head \"https://localhost/v2/kubernetes/pause/manifests/latest\": dial tcp 127.0.0.1:443: connect: connection refused" host=localhost
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.861665120Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:efs-csi-node-5fdcs,Uid:dc3211ea-b057-4c13-8d05-598cabb16988,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"localhost/kubernetes/pause\": failed to pull image \"l>
Jan 23 03:55:50 ip-10-0-130-143.ec2.internal containerd[4125]: time="2025-01-23T03:55:50.861694974Z" level=info msg="stop pulling image localhost/kubernetes/pause:latest: active requests=0, bytes read=0"

Checking the logs of one of the running 1.30 nodes we found the following:

Jan 22 04:25:28 ip-10-0-128-200.ec2.internal systemd[1]: Started containerd.service - containerd container runtime.
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.245168728Z" level=info msg="PullImage \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\""
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.759157615Z" level=info msg="ImageCreate event name:\"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"} labels:{key:\"io.cri-containerd.pinned\" value:\"pinned\"}"
Jan 22 04:25:30 ip-10-0-128-200.ec2.internal containerd[4224]: time="2025-01-22T04:25:30.761025880Z" level=info msg="stop pulling image 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5: active requests=0, bytes read=298689"

We were not sure why the 1.31 node was trying to pull the pause image from localhost as opposed to ECR like in the 1.30 node. So, we checked each nodes /etc/containerd/config.toml 1.31 config:

cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "localhost/kubernetes/pause"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/sbin/runc"
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

1.30 config:

cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

So, the al2023-1.31 AMI was now configuring the pause container to pull locally. Searching through the awslabs/amazon-eks-ami GitHub I found this PR #2000

The pause container is now being cached during the AMI image build. We now knew what the problem was.

We created our clusters using Terraform EKS-Blueprint. We also tried to follow some of the EKS Best Practices particularly this one Use multiple EBS volumes for containers This advises the use of a second volume with /var/lib/containerd mounted. The following script is run as a preBootstrapCommand:

 "systemctl stop containerd"
 "mkfs -t ext4 /dev/nvme1n1"
 "rm -rf /var/lib/containerd/*"
 "mount /dev/nvme1n1 /var/lib/containerd/"
 "systemctl start containerd"

One of the steps is to remove all directories under /var/lib/containerd prior to mounting to the volume. With the pause container now being cached in the AMI image its likely that cache was being deleted during by this bootstrap command.

To test this, we removed the second volume and the preBootstrapCommand from our terraform node group template. We ran the upgrade again and all the new nodes started and joined the cluster as expected.

At this point we are not sure if we need the second volume or not as our applications do not write often if ever to disk so disk quotas should not be and issues. We're doing some testing now to check the disk I/O on our nodes.

However, if we did need to again utilize a second volume for containerd and given that the pause container is now cached and pulled locally what would be a work-around in this scenario?

An update to the EKS Best Practices document may be in order as well.

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): r6i.4xlarge
  • Cluster Kubernetes version: 1.31
  • Node Kubernetes version: 1.31
  • AMI Version: amazon-eks-node-al2023-x86_64-standard-1.31-v20250116

dawgware avatar Jan 23 '25 18:01 dawgware

You’ll just need to copy the existing containerd working directory onto your dedicated volume prior to mounting it at /var/lib/containerd.

We do something similar for instance store volumes: https://github.com/awslabs/amazon-eks-ami/blob/64c70247ca8574e9c0efd5b672fa33fff41e76e7/templates/shared/runtime/bin/setup-local-disks#L132

There’s a NodeConfig option to enable that, you don’t need additional user data scripts: https://awslabs.github.io/amazon-eks-ami/nodeadm/doc/api/#localstorageoptions

I’ll see what we can do to improve the docs. I wouldn’t necessarily call a dedicated volume for pods a “best practice”, but it is a solution to a specific category of issues. We do not do this by default for managed nodegroups, for example.

cartermckinnon avatar Jan 23 '25 20:01 cartermckinnon

@cartermckinnon we do the same thing as OP where we have another EBS volume attached to our Nitro EC2 instances and we mount /var/lib/containerd to that XFS filesystem to separate OS disk IO and container FS and have come across the same error as OP.

It's unclear in the NodeConfig docs you've supplied around localstorageoptions how it's suppose to help us. I've gone ahead and tried to copy the existing contents in the /var/lib/containerd directory removing mounting the separate EBS volume with the script below without success:

mkfs -t xfs /dev/nvme1n1
cp -a /var/lib/containerd /tmp/ # copy containerd directory to /tmp/
systemctl stop containerd
rm -rf /var/lib/containerd
mkdir -p /var/lib/containerd
mount /dev/nvme1n1 /var/lib/containerd
cp -a /tmp/containerd /var/lib/containerd
systemctl start containerd

I still see the failure to pull the image from localhost. As a workaround, I've gone ahead and pointed the sandbox_image key in the /etc/containerd.config.toml file to the registry.k8s.io image:

sed -i 's/localhost\/kubernetes\/pause/registry.k8s.io\/pause/g' /etc/containerd/config.toml
systemctl start containerd

edify42 avatar Mar 17 '25 00:03 edify42

@edify42 please let us not increase the cost to the community by pointing to community images directly. please cache the image you need in your own ECR repo and use it. Or at least use a pull through cache - https://github.com/kubernetes/registry.k8s.io/blob/main/docs/mirroring/README.md#mirroring-with-ecr

dims avatar Mar 17 '25 01:03 dims

@cartermckinnon Thank-you for your reply and workaround proposal. After performing some I/O testing we determined that we did not need the second volume for containerd and have since removed the preBootstrapCommand and the second volume from our node groups.

dawgware avatar Mar 21 '25 15:03 dawgware

We were doing the exact same thing, and faced the exact same issue, thanks for the analysis ❤

nmamn avatar Mar 28 '25 12:03 nmamn

For reference, We ran into the same issue but use a hack that makes it kind of work:

For our CI pipelines we run instances with local nvme disks.

We use RAID0 for our local nvme volume but don't pair it with additional disks: It's possible to run a RAID0 set with just a single disk.

So in our NodeConfig, we have:

  instance:
    localStorage:
      strategy: RAID0

This copies the /var/lib/kubelet and /var/lib/containerd to the new storage device as expected.

There seems to be a small performance penalty using RAID0 with a single disk, but this is acceptable for us: Using EBS storage for our CI pipelines massively increases the pipeline duration, so having just a single nvme disk with a small performance penalty because of the RAID0 setup is better than having to use EBS storage.

This only works with a single volume: If you add additional volumes, those will be added to the RAID0 set as well, so having multiple volumes mounted and used as separate storage devices is not possible anymore. (It does improve the speed though)

fliphess avatar May 23 '25 11:05 fliphess

We ran into the same problem with the pause container when we switched to AL2023. After some digging in the repository, we found https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/bin/cache-pause-container, which if memory serves right, is executed during the build process of the AMI.

This means, the pause container image can actually be found in /etc/eks/pause.tar and just needs to be re-imported into containerd after the new volume is mounted at /var/lib/containerd.

We do this using cloud-init and ctr by adding the following to our pre-nodeadm bootstrap script:

# re-import pause image to containerd after mounting the new volume
echo "-- re-importing sandbox image from local file.."
systemctl start containerd

while [ $(systemctl is-active containerd) == "inactive" ]; do
  echo "-- waiting for containerd socket.."
  sleep 2s
done

echo "-- importing image.."
ctr --address /run/containerd/containerd.sock --namespace k8s.io image import --base-name localhost/kubernetes/pause:latest /etc/eks/pause.tar
echo "-- done. stopping containerd"
systemctl stop containerd

pvlkov avatar Jul 29 '25 13:07 pvlkov

We ran into this issue by accident when we suddenly containers in prod were not scheduling because of this error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost/kubernetes/pause": failed to pull image "localhost/kubernetes/pause": failed to pull and unpack image "localhost/kubernetes/pause:latest": failed to resolve reference "localhost/kubernetes/pause:latest": failed to do request: Head "https://localhost/v2/kubernetes/pause/manifests/latest": dial tcp 127.0.0.1:443: connect: connection refused

We haven't rotated the node for about a month and a half and it looks like the imported image got corrupted somehow. The only way we were able to fix it was by updating the config file to point to registry.k8s.io/pause as outlined in https://github.com/awslabs/amazon-eks-ami/issues/2122#issuecomment-2727764575.

We later saw the last comment which also shows how to import it from the local path, but it was too late to try out.

Is there a reason why the sandbox image isn't just using the image from https://gallery.ecr.aws/eks-distro/kubernetes/pause?

steve-todorov avatar Oct 02 '25 22:10 steve-todorov

In my case, the simplest solution is to locate the pause image locally and create the localhost/kubernetes/pause tag based on that image. It is presumed that the localhost/kubernetes/pause image was removed during disk pressure cleanup, which involves deleting unused images. In AWS EKS 1.32, using localhost/kubernetes/pause is believed to prevent the need for repeated image downloads. I freed up disk space by removing unused images with sudo ctr -n k8s.io images prune --all and then tagged the image with sudo ctr -n [k8s.io](http://k8s.io/) images tag \ [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.10](http://602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.10) \ localhost/kubernetes/pause:latest. (Note: 602401143424 varies by region: https://docs.aws.amazon.com/ko_kr/eks/latest/userguide/add-ons-images.html)

ryuseongryong avatar Oct 31 '25 00:10 ryuseongryong