sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

EKS sysbox undeterministic installation failure via sysbox-deploy-k8s daemonset

Open rauerhans opened this issue 1 year ago • 5 comments

Context

Hello everyone! We would like to use Sysbox in our Kubernetes stack at Plural.sh as a highly secure means to run eg docker or other processes on k8s pods for various intends and purposes. Ultimately we intend to support sysbox on all three major clouds with AWS EKS, GCP GKE, and Azure AKS. I'm currently experimenting with Sysbox on EKS.

Intent

So far I added an EKS managed node group to each AZ (our EKS cluster stretches across three AZs) with a custom AMI from the official Canonical EKS optimized AMIs. This will launch one node in each AZ of us-east-1. I then tried to install sysbox on all with the provided Daemonset + SA + Roles + RuntimeClass see here. I added only additional tolerations, to match our configuration.

Observed Behaviour

Unfortunately, the installation only succeeded on 2 of the 3 nodes. I tried to replicate this a couple of times, and I was able to install it on all three once, however it then happened again where it only worked on two. On the most recent attempt the sysbox-deploy-k8s pod had to restart twice on the nodes where it worked, but was then able to install sysbox successfully. Currently I have again three nodes where only two of them work correctly. It seems to be quite flaky.

I'm having a hard time identifying the underlying issue, so some help would be greatly appreciated.

Logs

Unsuccessful Installation/Unhealthy Node

First error indication fromsysbox-deploy-k8s pod logs of the unsuccessful installation:

Detected Kubernetes version v1.23
Adding K8s taint "sysbox-runtime=not-running:NoSchedule" to node ...
node/ip-10-0-50-152.us-east-2.compute.internal modified
Adding K8s label "sysbox-runtime=installing" to node ...
node/ip-10-0-50-152.us-east-2.compute.internal not labeled
Installing Sysbox dependencies on host ...
Copying shiftfs sources to host ...
Kernel version 5.15 is >= 5.15 and < 5.17
Deploying Sysbox installer helper on the host ...
Running Sysbox installer helper on the host (may take several seconds) ...
Stopping the Sysbox installer helper on the host ...
Removing Sysbox installer helper from the host ...
Installing Sysbox on host ...
cp: cannot create regular file '/mnt/host/usr/bin/sysbox-fs': Text file busy

I can only conjecture that something during the installation process doesn't finish appropriately and never lets go of this binary.

Investigating on the node according to your troubleshooting guide reveals that the first time something goes wrong is during the CRI-O installation service:

Aug 03 13:11:16 ip-10-0-50-152 systemd[1]: Starting CRI-O installer service...
Aug 03 13:11:16 ip-10-0-50-152 sh[65513]: inactive
Aug 03 13:11:16 ip-10-0-50-152 sh[65511]: Installing CRI-O ...
Aug 03 13:11:16 ip-10-0-50-152 sh[65511]: /usr/local/bin /
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/etc/
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/etc/crictl.yaml
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/etc/crio-umount.conf
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/etc/10-crun.conf
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/etc/crio.conf
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/conmon
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/crun
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/crio-status
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/pinns
Aug 03 13:11:16 ip-10-0-50-152 sh[65523]: cri-o/bin/crictl
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/bin/crio
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/bin/runc
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/install
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/contrib/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/contrib/10-crio-bridge.conf
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/contrib/crio.service
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/contrib/policy.json
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/man/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/man/crio.8
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/man/crio-status.8
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/man/crio.conf.5
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/man/crio.conf.d.5
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/fish/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/fish/crio.fish
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/fish/crio-status.fish
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/bash/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/bash/crio-status
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/bash/crio
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/zsh/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/zsh/_crio-status
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/completions/zsh/_crio
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/Makefile
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/README.md
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/vlan
Aug 03 13:11:17 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/bandwidth
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/macvlan
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/portmap
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/tuning
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/ptp
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/host-local
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/bridge
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/vrf
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/loopback
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/static
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/host-device
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/firewall
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/sbr
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/dhcp
Aug 03 13:11:18 ip-10-0-50-152 sh[65523]: cri-o/cni-plugins/ipvlan
Aug 03 13:11:18 ip-10-0-50-152 sh[65511]: /usr/local/bin/cri-o /usr/local/bin /
Aug 03 13:11:19 ip-10-0-50-152 sh[65542]: Using existing runc (soft-linking /usr/local/bin/runc -> /usr/local/bin/runc)
Aug 03 13:11:19 ip-10-0-50-152 sh[65571]: ln: failed to create symbolic link '/usr/local/bin/runc': File exists
Aug 03 13:11:19 ip-10-0-50-152 sh[65511]: CRI-O installation done.
Aug 03 13:11:19 ip-10-0-50-152 systemd[1]: Finished CRI-O installer service.
Aug 03 13:11:19 ip-10-0-50-152 systemd[1]: crio-installer.service: Succeeded.
Aug 03 13:11:19 ip-10-0-50-152 systemd[1]: Stopped CRI-O installer service.

From the logs I can tell that it's retrying a couple of times, but no error message despite the ln: failed to create symbolic link

The sysbox-installer-helper logs also show that it's retrying something, just stops and starts over and over, but no discernible error for me here:

Aug 03 17:27:26 ip-10-0-50-152 systemd[1]: sysbox-installer-helper.service: Succeeded.
Aug 03 17:27:26 ip-10-0-50-152 systemd[1]: Stopped Sysbox installer helper service.
Aug 03 17:32:37 ip-10-0-50-152 systemd[1]: Starting Sysbox installer helper service...
Aug 03 17:32:37 ip-10-0-50-152 sh[216518]: Reading package lists...
Aug 03 17:32:37 ip-10-0-50-152 sh[216518]: Building dependency tree...
Aug 03 17:32:37 ip-10-0-50-152 sh[216518]: Reading state information...
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]: ca-certificates is already the newest version (20230311ubuntu0.20.04.1).
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]: The following packages were automatically installed and are no longer required:
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   cpp cpp-9 dctrl-tools fakeroot g++ g++-9 gcc gcc-9 gcc-9-base
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   libasan5 libatomic1 libc-dev-bin libc6-dev libcc1-0 libcrypt-dev
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   libdpkg-perl libfakeroot libfile-fcntllock-perl libgcc-9-dev libisl22
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   libitm1 liblsan0 libmpc3 libquadmath0 libstdc++-9-dev libtsan0 libubsan1
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]:   linux-libc-dev manpages-dev
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]: Use 'apt autoremove' to remove them.
Aug 03 17:32:38 ip-10-0-50-152 sh[216518]: 0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Aug 03 17:32:38 ip-10-0-50-152 sh[216522]: Hit:1 http://us-east-2.ec2.archive.ubuntu.com/ubuntu focal InRelease
Aug 03 17:32:38 ip-10-0-50-152 sh[216522]: Hit:2 http://us-east-2.ec2.archive.ubuntu.com/ubuntu focal-updates InRelease
Aug 03 17:32:38 ip-10-0-50-152 sh[216522]: Get:3 http://us-east-2.ec2.archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Aug 03 17:32:38 ip-10-0-50-152 sh[216522]: Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Aug 03 17:32:38 ip-10-0-50-152 sh[216522]: Hit:5 http://ppa.launchpad.net/cloud-images/eks-01.11.0/ubuntu focal InRelease
Aug 03 17:32:39 ip-10-0-50-152 sh[216522]: Fetched 222 kB in 1s (291 kB/s)
Aug 03 17:32:40 ip-10-0-50-152 sh[216522]: Reading package lists...
Aug 03 17:32:40 ip-10-0-50-152 sh[217000]: Reading package lists...
Aug 03 17:32:40 ip-10-0-50-152 sh[217000]: Building dependency tree...
Aug 03 17:32:40 ip-10-0-50-152 sh[217000]: Reading state information...
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: fuse is already the newest version (2.9.9-3).
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: iptables is already the newest version (1.8.4-3ubuntu2.1).
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: rsync is already the newest version (3.1.3-8ubuntu0.5).
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: The following packages were automatically installed and are no longer required:
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   cpp cpp-9 dctrl-tools fakeroot g++ g++-9 gcc gcc-9 gcc-9-base
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   libasan5 libatomic1 libc-dev-bin libc6-dev libcc1-0 libcrypt-dev
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   libdpkg-perl libfakeroot libfile-fcntllock-perl libgcc-9-dev libisl22
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   libitm1 liblsan0 libmpc3 libquadmath0 libstdc++-9-dev libtsan0 libubsan1
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]:   linux-libc-dev manpages-dev
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: Use 'apt autoremove' to remove them.
Aug 03 17:32:41 ip-10-0-50-152 sh[217000]: 0 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Aug 03 17:32:41 ip-10-0-50-152 sh[216513]: Skipping shiftfs installation (it's already installed).
Aug 03 17:32:41 ip-10-0-50-152 sh[216513]: Probing kernel modules ...
Aug 03 17:32:41 ip-10-0-50-152 systemd[1]: Finished Sysbox installer helper service.
Aug 03 17:32:41 ip-10-0-50-152 systemd[1]: sysbox-installer-helper.service: Succeeded.
Aug 03 17:32:41 ip-10-0-50-152 systemd[1]: Stopped Sysbox installer helper service.
lines 959-1001/1001 (END)
root@ip-10-0-50-152:/# journalctl -eu kubelet-config-helper
-- Logs begin at Thu 2023-08-03 12:15:44 UTC, end at Thu 2023-08-03 17:24:00 UTC. --
-- No entries --

Looks like this process is never run, so it never reaches this step during the installation, which makes sense if it comes after the previous "succeeded", yet still failed step of the CRI-O installation

CRIO logs:

Aug 03 13:12:53 ip-10-0-50-152 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.328043240Z" level=info msg="Starting CRI-O, version: 1.23.2, git: e3b3bad0a394a955af1473a11380efb483f35a97(clean)"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.328241473Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.328260253Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.328282734Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.328295094Z" level=info msg="Node configuration value for cgroup v2 is false"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.334958405Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.342325387Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.388040368Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.388975631Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETUID, CAP_SETGID, CAP_SETPCAP, CAP_SETFCAP, CAP_NET_BIND_SERVICE, CAP_KILL, CAP_AUDIT_WRI>
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.392655561Z" level=info msg="Conmon does support the --sync option"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.392923165Z" level=info msg="No seccomp profile specified, using the internal default"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.392996506Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.393014486Z" level=info msg="No blockio config file specified, blockio not configured"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.393026186Z" level=info msg="RDT not available in the host system"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.406804317Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.409070888Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.411151866Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.411197977Z" level=info msg="Updated default CNI network name to aws-cni"
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.420697278Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory"
Aug 03 13:12:53 ip-10-0-50-152 systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Stopping Container Runtime Interface for OCI (CRI-O)...
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: crio.service: Succeeded.
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Stopped Container Runtime Interface for OCI (CRI-O).
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.767794074Z" level=info msg="Starting CRI-O, version: 1.23.2, git: e3b3bad0a394a955af1473a11380efb483f35a97(clean)"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768331311Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768426743Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768514374Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768597695Z" level=info msg="Node configuration value for cgroup v2 is false"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.774961543Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.782366685Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.786514082Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.787333183Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETUID, CAP_SETGID, CAP_SETPCAP, CAP_SETFCAP, CAP_NET_BIND_SERVICE, CAP_KILL, CAP_AUDIT_WRI>
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.790586868Z" level=info msg="Conmon does support the --sync option"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.790688250Z" level=info msg="No seccomp profile specified, using the internal default"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.790705990Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.790719180Z" level=info msg="No blockio config file specified, blockio not configured"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.790730230Z" level=info msg="RDT not available in the host system"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.802217418Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.804735762Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.806812431Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.806849882Z" level=info msg="Updated default CNI network name to aws-cni"
...
Aug 03 13:12:53 ip-10-0-50-152 crio[67057]: time="2023-08-03 13:12:53.420697278Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory"
Aug 03 13:12:53 ip-10-0-50-152 systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Stopping Container Runtime Interface for OCI (CRI-O)...
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: crio.service: Succeeded.
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Stopped Container Runtime Interface for OCI (CRI-O).
Aug 03 13:13:18 ip-10-0-50-152 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.767794074Z" level=info msg="Starting CRI-O, version: 1.23.2, git: e3b3bad0a394a955af1473a11380efb483f35a97(clean)"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768331311Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768426743Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768514374Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.768597695Z" level=info msg="Node configuration value for cgroup v2 is false"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.774961543Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.782366685Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.786514082Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Aug 03 13:13:18 ip-10-0-50-152 crio[67270]: time="2023-08-03 13:13:18.787333183Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETUID, CAP_SETGID, CAP_SETPCAP, CAP_SETFCAP, CAP_NET_BIND_SERVICE, CAP_KILL, CAP_AUDIT_WRI>
...

also looks like retrying something. This error also appears on the healthy nodes, so I discarded it as no relevant to my issue:

Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.025043786Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory"

kubelet is operational, shows the expected errors from crashbackofflooping the sysbox-deploy-k8spods

Aug 03 17:58:18 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:58:18.662835    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
Aug 03 17:58:29 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:58:29.663316    9393 scope.go:110] "RemoveContainer" containerID="f25399e2f42d76eef1bee12e0b5ea786fefb26bb27be270bd6548e17a438b690"
Aug 03 17:58:37 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:58:37.519533    9393 scope.go:110] "RemoveContainer" containerID="f25399e2f42d76eef1bee12e0b5ea786fefb26bb27be270bd6548e17a438b690"
Aug 03 17:58:37 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:58:37.520340    9393 scope.go:110] "RemoveContainer" containerID="f5e3df89bb2750c8df88ae08cde52260a94831a3a0db0c92a4891049daf1ca39"
Aug 03 17:58:37 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:58:37.521124    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
Aug 03 17:58:51 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:58:51.662115    9393 scope.go:110] "RemoveContainer" containerID="f5e3df89bb2750c8df88ae08cde52260a94831a3a0db0c92a4891049daf1ca39"
Aug 03 17:58:51 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:58:51.663338    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
Aug 03 17:59:06 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:59:06.661318    9393 scope.go:110] "RemoveContainer" containerID="f5e3df89bb2750c8df88ae08cde52260a94831a3a0db0c92a4891049daf1ca39"
Aug 03 17:59:06 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:59:06.662526    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
Aug 03 17:59:18 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:59:18.662118    9393 scope.go:110] "RemoveContainer" containerID="f5e3df89bb2750c8df88ae08cde52260a94831a3a0db0c92a4891049daf1ca39"
Aug 03 17:59:18 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:59:18.662729    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
Aug 03 17:59:29 ip-10-0-50-152 kubelet-eks.daemon[9393]: I0803 17:59:29.662127    9393 scope.go:110] "RemoveContainer" containerID="f5e3df89bb2750c8df88ae08cde52260a94831a3a0db0c92a4891049daf1ca39"
Aug 03 17:59:29 ip-10-0-50-152 kubelet-eks.daemon[9393]: E0803 17:59:29.663361    9393 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sysbox-deploy-k8s\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=sysbox-dep>
lines 958-1001/1001 (END)

Health status:

root@ip-10-0-50-152:/# systemctl status sysbox
● sysbox.service - Sysbox container runtime
     Loaded: loaded (/lib/systemd/system/sysbox.service; enabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: https://github.com/nestybox/sysbox

Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Dependency failed for Sysbox container runtime.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox.service: Job sysbox.service/start failed with result 'dependency'.
root@ip-10-0-50-152:/# systemctl status sysbox-mgr
● sysbox-mgr.service - sysbox-mgr (part of the Sysbox container runtime)
     Loaded: loaded (/lib/systemd/system/sysbox-mgr.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Thu 2023-08-03 12:22:14 UTC; 5h 49min ago
   Main PID: 19431 (code=killed, signal=TERM)

Aug 03 12:21:29 ip-10-0-50-152 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Starting ..."
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Sysbox data root: /var/lib/sysbox"
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).
root@ip-10-0-50-152:/# systemctl status sysbox-fs
● sysbox-fs.service - sysbox-fs (part of the Sysbox container runtime)
     Loaded: loaded (/lib/systemd/system/sysbox-fs.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-03 12:22:14 UTC; 5h 49min ago
   Main PID: 19852 (sysbox-fs)
      Tasks: 8 (limit: 9411)
     Memory: 2.3M
     CGroup: /system.slice/sysbox-fs.service
             └─19852 /usr/bin/sysbox-fs

Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Starting sysbox-fs (part of the Sysbox container runtime)...
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="Initiating sysbox-fs ..."
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="Initializing with 'allow-immutable-remounts' knob disabled (default)"
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="Initializing with 'allow-immutable-unmounts' knob enabled (default)"
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="FUSE dir = /var/lib/sysboxfs"
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="IOvec memParser elected"
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="Listening on /run/sysbox/sysfs.sock"
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Started sysbox-fs (part of the Sysbox container runtime).
Aug 03 12:22:14 ip-10-0-50-152 sysbox-fs[19852]: time="2023-08-03 12:22:14" level=info msg="Ready ..."
root@ip-10-0-50-152:/# journalctl -eu sysbox
-- Logs begin at Thu 2023-08-03 12:15:44 UTC, end at Thu 2023-08-03 18:12:54 UTC. --
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Dependency failed for Sysbox container runtime.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox.service: Job sysbox.service/start failed with result 'dependency'.
-- Logs begin at Thu 2023-08-03 12:15:44 UTC, end at Thu 2023-08-03 18:13:50 UTC. --
Aug 03 12:21:29 ip-10-0-50-152 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Starting ..."
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Sysbox data root: /var/lib/sysbox"
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).
root@ip-10-0-50-152:/# journalctl -eu sysbox-mgr

despite the errors during installation CRIO actually seems to be healthy:

root@ip-10-0-50-152:/# systemctl status crio     
● crio.service - Container Runtime Interface for OCI (CRI-O)
     Loaded: loaded (/etc/systemd/system/crio.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-03 14:13:25 UTC; 4h 6min ago
       Docs: https://github.com/cri-o/cri-o
   Main PID: 101749 (crio)
      Tasks: 11
     Memory: 9.6M
     CGroup: /system.slice/crio.service
             └─101749 /usr/local/bin/crio

Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.299025099Z" level=info msg="CNI monitoring event \"/etc/cni/net.d/ysbox\": CREATE"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.319323078Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.321876401Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.324304571Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.324453984Z" level=info msg="Updated default CNI network name to aws-cni"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.324571956Z" level=info msg="CNI monitoring event \"/etc/cni/net.d/ysbox\": WRITE"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.342360943Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.347170333Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.350606380Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 14:14:17 ip-10-0-50-152 crio[101749]: time="2023-08-03 14:14:17.350914425Z" level=info msg="Updated default CNI network name to aws-cni"

Successful Installation/Healthy Node

sysbox-deploy-k8s pod logs on one of the successful installations:

Detected Kubernetes version v1.23
Adding K8s taint "sysbox-runtime=not-running:NoSchedule" to node ...
node/ip-10-0-24-88.us-east-2.compute.internal modified
Stopping the Kubelet config agent on the host ...
Removing Kubelet config agent from the host ...
Kubelet reconfig completed.
Adding K8s label "crio-runtime=running" to node ...
node/ip-10-0-24-88.us-east-2.compute.internal labeled
Adding K8s label "sysbox-runtime=running" to node ...
node/ip-10-0-24-88.us-east-2.compute.internal labeled
Removing K8s taint "sysbox-runtime=not-running:NoSchedule" from node ...
node/ip-10-0-24-88.us-east-2.compute.internal untainted
The k8s runtime on this node is now CRI-O.
Sysbox installation completed.
Done.

Helper services logs:

root@ip-10-0-24-88:/# journalctl -eu crio-installer
...
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/ptp
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/host-local
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/bridge
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/vrf
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/loopback
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/static
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/host-device
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/firewall
Aug 03 12:20:32 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/sbr
Aug 03 12:20:33 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/dhcp
Aug 03 12:20:33 ip-10-0-24-88 sh[13684]: cri-o/cni-plugins/ipvlan
Aug 03 12:20:33 ip-10-0-24-88 sh[13667]: /usr/local/bin/cri-o /usr/local/bin /
Aug 03 12:20:33 ip-10-0-24-88 sh[13733]: Using existing runc (soft-linking /usr/local/bin/runc -> /usr/sbin/runc)
Aug 03 12:20:33 ip-10-0-24-88 sh[13768]: Created symlink /etc/systemd/system/cri-o.service → /etc/systemd/system/crio.service.
Aug 03 12:20:33 ip-10-0-24-88 sh[13768]: Created symlink /etc/systemd/system/multi-user.target.wants/crio.service → /etc/systemd/system/c>
Aug 03 12:20:33 ip-10-0-24-88 sh[13667]: CRI-O installation done.
Aug 03 12:20:33 ip-10-0-24-88 systemd[1]: Finished CRI-O installer service.
Aug 03 12:20:33 ip-10-0-24-88 systemd[1]: crio-installer.service: Succeeded.
Aug 03 12:20:33 ip-10-0-24-88 systemd[1]: Stopped CRI-O installer service.

Here it successfully creates the symlinks.

next in line the sysbox-installer-helper, also inconspicuous, and here I see no retries:

root@ip-10-0-24-88:/# journalctl -eu sysbox-installer-helper
Aug 03 12:21:25 ip-10-0-24-88 sh[18977]: EFI variables are not supported on this system
Aug 03 12:21:25 ip-10-0-24-88 sh[18968]: /sys/firmware/efi/efivars not found, aborting.
Aug 03 12:21:25 ip-10-0-24-88 sh[18489]: cleaning build area...
Aug 03 12:21:25 ip-10-0-24-88 sh[18489]: DKMS: build completed.
Aug 03 12:21:25 ip-10-0-24-88 sh[18467]: dkms install -m shiftfs -v k516
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]: shiftfs.ko:
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]: Running module version sanity check.
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]:  - Original module
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]:    - No original module exists within this kernel
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]:  - Installation
Aug 03 12:21:26 ip-10-0-24-88 sh[19058]:    - Installing to /lib/modules/5.15.0-1040-aws/updates/dkms/
Aug 03 12:21:29 ip-10-0-24-88 sh[19058]: depmod...
Aug 03 12:21:30 ip-10-0-24-88 sh[19182]: .
Aug 03 12:21:30 ip-10-0-24-88 sh[19058]: DKMS: install completed.
Aug 03 12:21:30 ip-10-0-24-88 sh[19213]: Reading package lists...
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: Building dependency tree...
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: Reading state information...
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: The following packages were automatically installed and are no longer required:
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   cpp cpp-9 dctrl-tools fakeroot g++ g++-9 gcc gcc-9 gcc-9-base
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   libasan5 libatomic1 libc-dev-bin libc6-dev libcc1-0 libcrypt-dev
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   libdpkg-perl libfakeroot libfile-fcntllock-perl libgcc-9-dev libisl22
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   libitm1 liblsan0 libmpc3 libquadmath0 libstdc++-9-dev libtsan0 libubsan1
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   linux-libc-dev manpages-dev
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: Use 'apt autoremove' to remove them.
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: The following packages will be REMOVED:
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]:   build-essential* dkms* dpkg-dev* make*
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: 0 upgraded, 0 newly installed, 4 to remove and 1 not upgraded.
Aug 03 12:21:31 ip-10-0-24-88 sh[19213]: After this operation, 2835 kB disk space will be freed.
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: [614B blob data]
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Removing dkms (2.8.1-5ubuntu2) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Removing build-essential (12.8ubuntu1.1) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Removing dpkg-dev (1.19.7ubuntu3.2) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Removing make (4.2.1-1.2) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: [614B blob data]
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Purging configuration files for dpkg-dev (1.19.7ubuntu3.2) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: Purging configuration files for dkms (2.8.1-5ubuntu2) ...
Aug 03 12:21:32 ip-10-0-24-88 sh[19213]: dpkg: warning: while removing dkms, directory '/var/lib/dkms' not empty so not removed
Aug 03 12:21:34 ip-10-0-24-88 sh[14071]: Shiftfs installation done.
Aug 03 12:21:34 ip-10-0-24-88 sh[14071]: Probing kernel modules ...
Aug 03 12:21:34 ip-10-0-24-88 systemd[1]: Finished Sysbox installer helper service.
Aug 03 12:21:34 ip-10-0-24-88 systemd[1]: sysbox-installer-helper.service: Succeeded.
Aug 03 12:21:34 ip-10-0-24-88 systemd[1]: Stopped Sysbox installer helper service.
lines 370-412/412 (END)

CRIO logs:

root@ip-10-0-24-88:/# journalctl -u crio                    
-- Logs begin at Thu 2023-08-03 12:15:44 UTC, end at Thu 2023-08-03 17:50:38 UTC. --
Aug 03 12:21:36 ip-10-0-24-88 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.936810453Z" level=info msg="Starting CRI-O, version: 1.23.2, git: e3b3bad0a394a955af1473a11380efb483f35a97(clean)"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.936993436Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.937013306Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.937034886Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.937047346Z" level=info msg="Node configuration value for cgroup v2 is false"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.942972193Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.950521011Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.986976133Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.988070437Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETUID, CAP_SETGID, CAP_SETPCAP, CAP_SETFCAP, CAP_NET_BIND_SERVICE, CAP_KILL, CAP_AUDIT_WRIT>
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.993040421Z" level=info msg="Conmon does support the --sync option"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.993201283Z" level=info msg="No seccomp profile specified, using the internal default"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.993235584Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.993249164Z" level=info msg="No blockio config file specified, blockio not configured"
Aug 03 12:21:36 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:36.993261034Z" level=info msg="RDT not available in the host system"
Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.005105277Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.013961362Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.016186461Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.016315493Z" level=info msg="Updated default CNI network name to aws-cni"
Aug 03 12:21:37 ip-10-0-24-88 crio[19570]: time="2023-08-03 12:21:37.025043786Z" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory"
Aug 03 12:21:37 ip-10-0-24-88 systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
Aug 03 12:21:48 ip-10-0-24-88 systemd[1]: Stopping Container Runtime Interface for OCI (CRI-O)...
Aug 03 12:21:48 ip-10-0-24-88 systemd[1]: crio.service: Succeeded.
Aug 03 12:21:48 ip-10-0-24-88 systemd[1]: Stopped Container Runtime Interface for OCI (CRI-O).
Aug 03 12:21:48 ip-10-0-24-88 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.525113558Z" level=info msg="Starting CRI-O, version: 1.23.2, git: e3b3bad0a394a955af1473a11380efb483f35a97(clean)"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.525271960Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.525290040Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.525310480Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.525329251Z" level=info msg="Node configuration value for cgroup v2 is false"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.532001126Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.538590731Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.542605872Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.543272811Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETUID, CAP_SETGID, CAP_SETPCAP, CAP_SETFCAP, CAP_NET_BIND_SERVICE, CAP_KILL, CAP_AUDIT_WRIT>
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.545840873Z" level=info msg="Conmon does support the --sync option"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.545941545Z" level=info msg="No seccomp profile specified, using the internal default"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.545957375Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.545968945Z" level=info msg="No blockio config file specified, blockio not configured"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.545981835Z" level=info msg="RDT not available in the host system"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.557395092Z" level=info msg="Found CNI network aws-cni (type=aws-cni) at /etc/cni/net.d/10-aws.conflist"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.559801682Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.561848219Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 03 12:21:48 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:48.562018761Z" level=info msg="Updated default CNI network name to aws-cni"
Aug 03 12:21:48 ip-10-0-24-88 systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
Aug 03 12:21:49 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:49.067017018Z" level=info msg="Checking image status: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5" id=f899e148-54d6-4ae8-99ed-2f227eef6886 name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:49 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:49.067719877Z" level=info msg="Image 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause:3.5 not found" id=f899e148-54d6-4ae8-99ed-2f227eef6886 name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:50 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:50.362587165Z" level=info msg="Running pod sandbox: monitoring/monitoring-prometheus-node-exporter-tzfn8/POD" id=29fd9885-1678-44be-9c43-7bd0d48be3a8 name=/runtime.v1.RuntimeService/RunPodSandbox
Aug 03 12:21:50 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:50.362654056Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.639481980Z" level=info msg="Ran pod sandbox 54fd3e18e692e26d0036f40b94afab4ac164ca605c3142510baa9a37f16a4b94 with infra container: monitoring/monitoring-prometheus-node-exporter-tzfn8/POD" id=29fd9885-1678>
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.640936278Z" level=info msg="Checking image status: dkr.plural.sh/monitoring/prometheus/node-exporter:v1.3.1" id=33c01c9c-4e77-4a81-aaf8-f227f82918b2 name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.641259702Z" level=info msg="Image dkr.plural.sh/monitoring/prometheus/node-exporter:v1.3.1 not found" id=33c01c9c-4e77-4a81-aaf8-f227f82918b2 name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.642283936Z" level=info msg="Pulling image: dkr.plural.sh/monitoring/prometheus/node-exporter:v1.3.1" id=de0e14a6-6988-4e1e-9eaa-ee61c2ec38e6 name=/runtime.v1.ImageService/PullImage
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.644884249Z" level=info msg="Trying to access \"dkr.plural.sh/monitoring/prometheus/node-exporter:v1.3.1\""
Aug 03 12:21:51 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:51.904780300Z" level=info msg="Trying to access \"dkr.plural.sh/monitoring/prometheus/node-exporter:v1.3.1\""
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.165210926Z" level=info msg="Running pod sandbox: calico-system/calico-node-cgtx7/POD" id=f63e1cc4-7a00-4cc5-88b9-eb7df5c4d188 name=/runtime.v1.RuntimeService/RunPodSandbox
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.165596120Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.202683754Z" level=info msg="Ran pod sandbox 522ef087aede1c7bda6935bbae8905d45a15a8e229538e71d231865661d0d7a9 with infra container: calico-system/calico-node-cgtx7/POD" id=f63e1cc4-7a00-4cc5-88b9-eb7df5c4d1>
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.209460720Z" level=info msg="Checking image status: gcr.io/pluralsh/calico/pod2daemon-flexvol:v3.25.0" id=2cf8113a-4cd7-403c-975a-b633eb532dff name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.210091328Z" level=info msg="Image gcr.io/pluralsh/calico/pod2daemon-flexvol:v3.25.0 not found" id=2cf8113a-4cd7-403c-975a-b633eb532dff name=/runtime.v1.ImageService/ImageStatus
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.211401585Z" level=info msg="Pulling image: gcr.io/pluralsh/calico/pod2daemon-flexvol:v3.25.0" id=8076af45-1549-4b2b-9a9d-d42ba328463c name=/runtime.v1.ImageService/PullImage
Aug 03 12:21:52 ip-10-0-24-88 crio[24448]: time="2023-08-03 12:21:52.213061326Z" level=info msg="Trying to access \"gcr.io/pluralsh/calico/pod2daemon-flexvol:v3.25.0\""

looks pretty much the same as the unsuccessful one, except it assumes normal operation afterwards.

crio.conf

root@ip-10-0-24-88:/# cat /etc/crio/crio.conf

[crio]
  storage_driver = "overlay"
  storage_option = ["overlay.mountopt=metacopy=on"]

  [crio.image]

  [crio.network]
    plugin_dirs = ["/opt/cni/bin"]

  [crio.runtime]
    conmon_cgroup = "pod"
    cgroup_manager = "cgroupfs"
    default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "SETUID", "SETGID", "SETPCAP", "SETFCAP", "NET_BIND_SERVICE", "KILL", "AUDIT_WRITE", "NET_RAW", "SYS_CHROOT", "MKNOD"]
    pids_limit = 16384
    selinux = false

    [crio.runtime.runtimes]

      [crio.runtime.runtimes.sysbox-runc]
        allowed_annotations = ["io.kubernetes.cri-o.userns-mode"]
        runtime_path = "/usr/bin/sysbox-runc"
        runtime_type = "oci"

Configuration

Nodes are labeled appropriately as per your k8s install guide.

This is the configuration of one of the nodes where the installation was successful, the unsuccessful one looks no different except for the label sysbox-runtime:installing and the additional taint sysbox-runtime=not-running:NoSchedule that gets added during the sysbox installation process.

apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0592c9e85037c9cb1","efs.csi.aws.com":"i-0592c9e85037c9cb1"}'
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: t3a.large
    beta.kubernetes.io/os: linux
    crio-runtime: running
    eks.amazonaws.com/capacityType: ON_DEMAND
    eks.amazonaws.com/nodegroup: sysbox-s-ondemand-us-east-2a-subnet-0af5ea0170708e442
    eks.amazonaws.com/nodegroup-image: ami-014e19779edee1ec8
    eks.amazonaws.com/sourceLaunchTemplateId: lt-0c1499b2d5f2eaaf9
    eks.amazonaws.com/sourceLaunchTemplateVersion: "1"
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2a
    k8s.io/cloud-provider-aws: ef3abbe0ed918e0ae9e2963a2e9cf3d9
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-24-88
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: t3a.large
    plural.sh/capacityType: ON_DEMAND
    plural.sh/performanceType: SUSTAINED
    plural.sh/scalingGroup: sysbox-s-ondemand
    sysbox-install: "yes"
    sysbox-runtime: running
    topology.ebs.csi.aws.com/zone: us-east-2a
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2a
  name: ip-10-0-24-88.us-east-2.compute.internal
spec:
  taints:
  - effect: NoSchedule
    key: plural.sh/sysbox
    value: "true"
  - effect: NoSchedule
    key: plural.sh/capacityType
    value: ON_DEMAND

Additional labels and taints are applied by us or EKS managed service.

The only difference I can think of that between the different nodes would be network related, as each node lives in a different subnet, as each is on a different AZ, not sure what would make one unique, though

AMI filter used: ubuntu-eks/k8s_1.23/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-* AMI: ami-014e19779edee1ec8 host info:

root@ip-10-0-40-22:/etc/cni/net.d#  hostnamectl
   Static hostname: ip-10-0-40-22
         Icon name: computer-vm
           Chassis: vm
        Machine ID: ec25eca82adadd232c6e82d4e5767c27
           Boot ID: 75ddabdc8b8d42ffa3a088c379e4ea0a
    Virtualization: kvm
  Operating System: Ubuntu 20.04.6 LTS
            Kernel: Linux 5.15.0-1040-aws
      Architecture: x86-64

Other Services

We operate a complex cluster setup, where we have other services as Daemonsets in place that tolerate all taints and that will run on every node, namely Calico and some aws- or monitoring-related services.

image

rauerhans avatar Aug 03 '23 18:08 rauerhans

got it working on all three nodes again, but again only after pods restart a couple of times image

rauerhans avatar Aug 04 '23 11:08 rauerhans

Okay, from the event reports I'm getting some indication:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_sysbox-deploy-k8s-27n74_kube-system_1b35b35d-9d58-4644-a1c6-ad18c1e922b0_0(edf24494b4bf0864b178b83e1ebca8838316084a3837b8b386fb15fc4b37cb2b): error adding pod kube-system_sysbox-deploy-k8s-27n74 to CNI network "aws-cni": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

This precedes the restarts. The logs of the pods themselves don't show any errors, before they're terminated.

rauerhans avatar Aug 04 '23 11:08 rauerhans

this documents this error in part, I'm installing sysbox immediately after spinning up the nodes, so let me try to wait a couple of seconds or add a job that waits for the aws-node pod to be in running state

https://repost.aws/knowledge-center/eks-failed-create-pod-sandbox

Update: this didn't help, the pods still run into the same Event errors, restart a couple of times and eventually either fail or succeed. Still no discernible root cause.

rauerhans avatar Aug 04 '23 12:08 rauerhans

Side note: not really observing this from your uninstallation guide, the only pods that get interrupted and restarted are the kube-proxy and aws-node.

NOTES:

Make sure to stop all K8s pods deployed with Sysbox prior to uninstalling Sysbox.

The uninstallation will temporarily disrupt all pods on the nodes where CRI-O and Sysbox were installed, for up to 1 minute, as they require the kubelet to restart.

The 'sleep' instructions above ensure that kubelet has a chance to launch and execute the daemonsets before the subsequent step.

rauerhans avatar Aug 04 '23 13:08 rauerhans

@rauerhans, thanks for filling this one up. I don't remember seeing instability issues like the one you describe, so I believe it must be something recently introduced.

The first log you had some concerns about (below), seems to be a bug in this logic, but I suspect that this has nothing to do with the issue at hand.

I believe the issue we need to focus on is related to this output you previously pasted (see below). which seems to be matching the problem described in issue #713.

# systemctl status sysbox-mgr
● sysbox-mgr.service - sysbox-mgr (part of the Sysbox container runtime)
     Loaded: loaded (/lib/systemd/system/sysbox-mgr.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Thu 2023-08-03 12:22:14 UTC; 5h 49min ago
   Main PID: 19431 (code=killed, signal=TERM)

Aug 03 12:21:29 ip-10-0-50-152 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Starting ..."
Aug 03 12:21:29 ip-10-0-50-152 sysbox-mgr[19431]: time="2023-08-03 12:21:29" level=info msg="Sysbox data root: /var/lib/sysbox"
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: start operation timed out. Terminating.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: sysbox-mgr.service: Failed with result 'timeout'.
Aug 03 12:22:14 ip-10-0-50-152 systemd[1]: Failed to start sysbox-mgr (part of the Sysbox container runtime).

Please take a look at issue #713 and try to manually remove any sysbox-mgr.pid file that you may encounter in the affected node. If the corresponding k8s node comes up after this step, it would prove that this is the smoking-gun that we are after.

rodnymolina avatar Aug 04 '23 23:08 rodnymolina