bottlerocket Bottlerocket node intermittently fails to start with "[FAILED] Failed to start Wait for Network to be Configured."

Hey folks,

We've been running our EKS cluster with Bottlerocket nodes provisioned by Karpenter for a while now, with no issue whatsoever - it's been rock solid :)

However, last night we hit an issue with Pods staying Pending for too long. We investigated and realised Karpenter correctly created a NodeClaim and started up an EC2 instance, but this new instance failed to boot and join the cluster. After 15 minutes, Karpenter realised something was wrong, deleted that node and tried again with a new node.

We thought it was only a transient issue at first, but it happened again this morning, at which point I grabbed an instance and enabled termination protection to keep it alive for debugging.

The startup logs show:

UEFI firmware (version  built at 09:00:00 on Nov  1 2018)
[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[0m[35m[40m[0m[37m[40m  Booting `Bottlerocket OS 1.18.0'


Welcome to Bottlerocket OS 1.18.0 (aws-k8s-1.28)!

[  OK  ] Created slice Slice /system/modprobe.
[  OK  ] Reached target Path Units.
[  OK  ] Reached target Slice Units.
[  OK  ] Reached target Swaps.
[  OK  ] Listening on Journal Audit Socket.
[  OK  ] Listening on Journal Socket (/dev/log).
[  OK  ] Listening on Journal Socket.
[  OK  ] Listening on Network Service Netlink Socket.
[  OK  ] Listening on udev Control Socket.
[  OK  ] Listening on udev Kernel Socket.
         Mounting Huge Pages File System...
         Mounting POSIX Message Queue File System...
         Mounting CNI Configuration Directory (/etc/cni)...
         Mounting Kernel Debug File System...
         Mounting Kernel Trace File System...
         Mounting Temporary Directory /tmp...
         Starting Load audit rules...
         Starting Checks and marks if boot has ever succeeded before...
         Starting Create List of Static Device Nodes...
         Starting Load Kernel Module configfs...
         Starting Load Kernel Module efi_pstore...
         Starting Load Kernel Module fuse...
         Starting Copy SELinux policy files...
         Starting Journal Service...
         Starting Load Kernel Modules...
         Starting Generate network units from Kernel command line...
         Starting Remount Root and Kernel File Systems...
         Starting Coldplug All udev Devices...
[  OK  ] Finished Create List of Static Device Nodes.
[  OK  ] Finished Load audit rules.
[  OK  ] Finished Load Kernel Module configfs.
[  OK  ] Finished Load Kernel Module efi_pstore.
         Mounting Kernel Configuration File System...
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted CNI Configuration Directory (/etc/cni).
[  OK  ] Mounted Kernel Debug File System.
[  OK  ] Mounted Kernel Trace File System.
[  OK  ] Mounted Temporary Directory /tmp.
[  OK  ] Mounted Kernel Configuration File System.
[  OK  ] Finished Load Kernel Module fuse.
         Mounting FUSE Control File System...
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Finished Generate network units from Kernel command line.
[  OK  ] Finished Remount Root and Kernel File Systems.
         Starting Create System Users...
[  OK  ] Started Journal Service.
[  OK  ] Finished Checks and marks if boot has ever succeeded before.
[  OK  ] Finished Create System Users.
         Starting Create Static Device Nodes in /dev...
[  OK  ] Finished Copy SELinux policy files.
         Mounting Containerd Configuration Directory (/etc/containerd)...
         Mounting Host containers Configuratâ¦irectory (/etc/host-containers)...
         Mounting Kubernetes PKI private dirâ¦y (/etc/kubernetes/pki/private)...
         Mounting AWS configuration directory (/root/.aws)...
         Mounting Ephemeral netdog configuration directory...
[  OK  ] Mounted Containerd Configuration Directory (/etc/containerd).
[  OK  ] Mounted Host containers Configuratiâ¦ Directory (/etc/host-containers).
[  OK  ] Mounted Kubernetes PKI private direâ¦ory (/etc/kubernetes/pki/private).
[  OK  ] Mounted AWS configuration directory (/root/.aws).
[  OK  ] Mounted Ephemeral netdog configuration directory.
[  OK  ] Finished Create Static Device Nodes in /dev.
[  OK  ] Reached target Preparation for Local File Systems.
         Starting Rule-based Manager for Device Events and Files...
[  OK  ] Finished Load Kernel Modules.
         Starting Apply Kernel Variables...
[  OK  ] Finished Coldplug All udev Devices.
[  OK  ] Finished Apply Kernel Variables.
[  OK  ] Started Rule-based Manager for Device Events and Files.
[  OK  ] Found device Amazon Elastic Block Store 13.
[  OK  ] Found device Amazon Elastic Block Store BOTTLEROCKET-PRIVATE.
[  OK  ] Found device Amazon Elastic Block Store 1.
         Starting Repart preferred data partition...
[  OK  ] Finished Repart preferred data partition.
[  OK  ] Found device Amazon Elastic Block Store BOTTLEROCKET-DATA.
         Starting Prepare Local Filesystem (/local)...
[  OK  ] Stopped Repart preferred data partition.
[  OK  ] Finished Prepare Local Filesystem (/local).
         Mounting Local Directory (/local)...
[  OK  ] Mounted Local Directory (/local).
         Mounting Mnt Directory (/mnt)...
         Mounting Opt Directory (/opt)...
         Mounting Var Directory (/var)...
         Starting Resize Data Partition...
[  OK  ] Mounted Mnt Directory (/mnt).
[  OK  ] Mounted Opt Directory (/opt).
[  OK  ] Mounted Var Directory (/var).
         Mounting Private Directory (/var/lib/bottlerocket)...
         Starting Mask Local Mnt Directory (/local/mnt)...
         Starting Mask Local Opt Directory (/local/opt)...
         Starting Mask Local Var Directory (/local/var)...
         Starting Prepare Opt Directory (/opt)...
         Starting Prepare Containerd Directory (/var/lib/containerd)...
         Starting Prepare Kubelet Directory (/var/lib/kubelet)...
         Starting Prepare Var Directory (/var)...
         Starting Flush Journal to Persistent Storage...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Private Directory (/var/lib/bottlerocket).
[  OK  ] Finished Mask Local Mnt Directory (/local/mnt).
[  OK  ] Finished Mask Local Opt Directory (/local/opt).
[  OK  ] Finished Mask Local Var Directory (/local/var).
[  OK  ] Finished Load/Save Random Seed.
[  OK  ] Finished Resize Data Partition.
[  OK  ] Finished Prepare Opt Directory (/opt).
[  OK  ] Finished Prepare Containerd Directory (/var/lib/containerd).
[  OK  ] Finished Prepare Kubelet Directory (/var/lib/kubelet).
[  OK  ] Reached target First Boot Complete.
[  OK  ] Finished Flush Journal to Persistent Storage.
[  OK  ] Finished Prepare Var Directory (/var).
         Mounting Kernel Modules (Read-Write)...
         Mounting CNI Plugin Directory (/opt/cni/bin)...
[  OK  ] Mounted Kernel Modules (Read-Write).
[  OK  ] Mounted CNI Plugin Directory (/opt/cni/bin).
[  OK  ] Reached target Local File Systems.
         Mounting License files...
         Mounting Kernel Development Sources (Read-Only)...
         Starting Commit a transient machine-id on disk...
         Starting Create Volatile Files and Directories...
[  OK  ] Mounted License files.
[  OK  ] Mounted Kernel Development Sources (Read-Only).
         Mounting Kernel Development Sources (Read-Write)...
[  OK  ] Mounted Kernel Development Sources (Read-Write).
[  OK  ] Finished Commit a transient machine-id on disk.
[  OK  ] Finished Create Volatile Files and Directories.
         Starting Rebuild Dynamic Linker Cache...
         Starting Rebuild Journal Catalog...
         Starting Network Name Resolution...
[  OK  ] Finished Rebuild Journal Catalog.
[  OK  ] Started Network Name Resolution.
[  OK  ] Reached target Host and Network Name Lookups.
[  OK  ] Finished Rebuild Dynamic Linker Cache.
         Starting Update is Completed...
[  OK  ] Finished Update is Completed.
[  OK  ] Reached target System Initialization.
[  OK  ] Started Scheduled Metricdog Pings.
[  OK  ] Started Daily Cleanup of Temporary Directories.
[  OK  ] Reached target Timer Units.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Socket Units.
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
[  OK  ] Reached target Basic System.
         Starting ACPI event daemon...
         Starting Generate network configuration...
         Starting Call signpost to mark the â¦r all required targets are met....
         Starting Bottlerocket data store migrator...
[  OK  ] Started ACPI event daemon.
[  OK  ] Finished Call signpost to mark the â¦ter all required targets are met..
         Starting Prepare Boot Directory (/boot)...
[  OK  ] Finished Generate network configuration.
[  OK  ] Reached target Preparation for Network.
         Starting Network Configuration...
[  OK  ] Finished Prepare Boot Directory (/boot).
         Starting Disable kexec load syscalls...
[  OK  ] Finished Disable kexec load syscalls.
[    3.087466] migrator[2414]: Data store does not exist at given path, exiting (/var/lib/bottlerocket/datastore/current)
[  OK  ] Finished Bottlerocket data store migrator.
         Starting Datastore creator...
[  OK  ] Finished Datastore creator.
         Starting Bottlerocket API server...
[  OK  ] Started Network Configuration.
[  OK  ] Reached target Network.
         Starting Wait for Network to be Configured...
[  OK  ] Started Bottlerocket API server.
[FAILED] Failed to start Wait for Network to be Configured.
See 'systemctl status systemd-networkd-wait-online.service' for details.
         Starting Write network status...
[  OK  ] Finished Write network status.
[  OK  ] Reached target Network is Online.
         Starting Bottlerocket userdata configuration system...

The main difference I can see between this and a healthy instance is the [FAILED] Failed to start Wait for Network to be Configured. log line, which doesn't appear on a healthy instance.

I'm unsure how to debug this further - currently trying to get an Instance Connect endpoint in that subnet, but I'm not sure it'll work given the instance seems to not have a working network. I can keep it for a few days if there's any debugging I can run to help pinpoint the root cause here :)

Image I'm using: ami-07d1105485cff7781

bottlerocket-aws-k8s-1.28-aarch64-v1.18.0-7452c37e EKS 1.28 Karpenter 0.32

What I expected to happen: instance starts up and join EKS cluster

What actually happened: instance fails to start up, and doesn't join the cluster

How to reproduce the problem: I can't reproduce this reliably. It started happening on 2024-01-22 around 10pm UTC and has happened 7 times since, but we've also reliably provisioned hundred of similar instances in the same subnets with the same AMI in this timeframe, so it's only intermittent.

Jan 23 '24 11:01 Pluies

Update, Instance Connect endpoint didn't work out.

Also I didn't mention it in the original report, if that's any help we're provisioning m7g.8xlarge instances.

Jan 23 '24 12:01 Pluies

I've managed to get some logs 🥳 I took snapshots from the EBS volumes of the broken instance, recreated volumes from them, attached them to a new instance, and we're off to the races. 👍

The systemd unit that failed isn't very helpful log-wise:

[ec2-user@ip-10-0-155-70 vol1]$ sudo journalctl --root /vol1/ -u systemd-networkd-wait-online.service
Jan 23 10:47:53 localhost systemd[1]: Starting Wait for Network to be Configured...
Jan 23 10:49:53 localhost systemd-networkd-wait-online[2582]: Timeout occurred while waiting for network connectivity.
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Jan 23 10:49:53 localhost systemd[1]: Failed to start Wait for Network to be Configured.

I'm attaching the full journalctl output from boot: journalctl.log

Jan 23 '24 16:01 Pluies

Hello @Pluies, thanks for cutting this issue! These failures are interesting and we are going to take a look. Thanks for attaching the journalctl.log which has some useful errors to dive into:

Jan 23 10:50:01 localhost kernel: ena 0000:00:05.0 eth0: TX hasn't completed, qid 2, index 13. 127850 msecs since last interrupt, 127850 msecs since last napi execution, napi scheduled: 0

We'll update the issue when we know more. In the meantime, can you confirm how many instances you have seen this happen on and when it first happened? Did you ever see this type of failure on previous Bottlerocket releases?

Jan 23 '24 17:01 yeazelm

Hi @yeazelm , thanks for having a look! :)

In the meantime, can you confirm how many instances you have seen this happen on and when it first happened?

Sure! This has happened 7 times in the past 24hrs:

{"level":"DEBUG","time":"2024-01-23T11:02:44.635Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-qshsb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T10:46:45.896Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-6n8nn","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T09:26:41.677Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-4k6tm","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T04:36:35.005Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-rsdc7","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T02:15:55.465Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bbnpz","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T00:28:45.271Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bfllb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-22T21:38:18.128Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-jzzzh","nodepool":"trino","ttl":"15m0s"}

In the same timespan we provisioned 1028 machines.

Did you ever see this type of failure on previous Bottlerocket releases?

Looking back, similar errors have happened in the past at much lower rates and we didn't notice it. I can't confirm whether this was the exact same issue, but we've had the same terminating due to registration ttl karpenter message...

Once on December 29th
Once on December 12th
4 times on December 2nd

That's it for the Oct 23rd - Jan 23rd period (we keep 90 days of logs).

Jan 23 '24 18:01 Pluies

Thanks for the quick update @Pluies ! That frequency is super helpful in trying to find what might be the cause.

Jan 23 '24 19:01 yeazelm

I'm not sure whether this is just a result of the above or a contributing factor, but we're seeing the following error message in the AWS dashboard:

Instance status checks
Instance reachability check failed
Check failure at
2024/01/23 11:50 GMT+1 (1 day)

I'm now going to kill the node above (it's not cheap 😅), which should be fine given we have the volumes + logs.

The issue hasn't reappeared for the past 24h.

Jan 24 '24 12:01 Pluies

@Pluies can you please tell me the region you are using?

Jan 24 '24 13:01 shaharitzko

@shaharitzko we're running in eu-west-1 (Ireland), this specific instance was in euw1-az3

Jan 24 '24 13:01 Pluies

@Pluies Could you cut a ticket to AWS support and include the instance ids of the failures you know about and refer to this issue? I think that might help us dig in more.

Jan 24 '24 15:01 yeazelm

Hi @yeazelm , I just opened a ticket with the machine IDs from yesterday, hopefully this will shed some light on the issue!

Jan 24 '24 16:01 Pluies

Update from AWS:

Just as a precaution I have opened an internal investigation into the Bottleneck AMI related network error with the EKS service team and they are looking into a possible kernel bug. Please let me know if you see any reocurrences of the same problem.

We've not seen this happen again since Jan 26th fwiw, I'll update this thread with results of the ticket above. 👍

Feb 02 '24 09:02 Pluies

Just a quick update. I did hear from that internal investigation that they still investigating, but I wanted to confirm with @Pluies if you are still experiencing this or if it stopped after Jan 26th?

Feb 26 '24 15:02 yeazelm

The issue happens much less often, but we have seen it again on:

2024-02-09T00:38:09Z
2024-02-18T23:04:09Z
2024-02-25T09:39:13Z
2024-02-25T13:53:08Z

Feb 26 '24 20:02 Pluies

bottlerocket bottlerocket copied to clipboard

Bottlerocket node intermittently fails to start with "[FAILED] Failed to start Wait for Network to be Configured."

bottlerocket
bottlerocket copied to clipboard