cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Unable to create cluster using amazon-2 ami

Open MaxFedotov opened this issue 2 years ago • 5 comments
trafficstars

/kind bug

What steps did you take and what happened: Create a cluster using capa-ami-amazon-2-v1.25.12 image. Control-plane node won't be started and the following error will be in control-plane kubelet logs:

Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104552    4449 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104604    4449 kuberuntime_sandbox.go:71] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104631    4449 kuberuntime_manager.go:772] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104708    4449 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"etcd-ip-10-189-0-251.eu-central-1.compute.internal_kube-system(a812507b09a2bba6c5690db77f322d9f)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"etcd-ip-10-189-0-251.eu-central-1.compute.internal_kube-system(a812507b09a2bba6c5690db77f322d9f)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown\"" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal" podUID=a812507b09a2bba6c5690db77f322d9f

If you will try to run runc binary, the following error will be returned:

root@ip-10-189-0-251 ~]# /usr/local/sbin/runc --help
/usr/local/sbin/runc: symbol lookup error: /usr/local/sbin/runc: undefined symbol: seccomp_notify_respond

This happens because CAPA is using cri-containerd-*.tar.gz archive to install containerd and runc. According to containerd release notes: https://github.com/containerd/containerd/blob/40f26543bdc27cbe8b058ac082e91c5832bb1c41/releases/v1.6.0.toml#L64-L76 runc, included in containerd distribution is built with dynamic linking to libseccomp.

CAPA is using the following version of containerd:

[root@ip-10-189-0-251 ~]# /usr/local/bin/containerd --version
containerd github.com/containerd/containerd v1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8

which according to release notes includes runc v1.1.7.

runc v1.1.7 is linked to libseccomp-2.5.4, but installed version is

[root@ip-10-189-0-251 ~]# yum list installed | grep libsec
libseccomp.x86_64                     2.4.1-1.amzn2                  installed

which is the maximum libseccomp version available for epel7 repo.

What did you expect to happen: User should be able to create cluster using amazon linux 2 images.

Anything else you would like to add: I was able to fix this issue in my image-builder fork by adding ansible steps to manually download statically-linked runc from https://github.com/opencontainers/runc/releases and replace runc installed by cri-containerd-*.tar.gz archive. I can create a pull request in image-builder repo with the fix if you are ok with this approach.

Environment:

  • Cluster-api-provider-aws version: registry.k8s.io/cluster-api-aws/cluster-api-aws-controller:v2.2.1
  • Kubernetes version: (use kubectl version): v1.25.12
  • OS (e.g. from /etc/os-release): amazon linux 2

MaxFedotov avatar Aug 01 '23 14:08 MaxFedotov