operating-system-manager
operating-system-manager copied to clipboard
Hetzner worker nodes losing IPv4 with Ubuntu 24.04
Similar to kubermatic/machine-controller#1587 we were hit by 2 complete outages of Hetzner cloud worker nodes recently again. We are at kubeone 1.9.0 with OSM 1.6.0.
I found out what exactly happens and it begins already with the deployment of the worker:
- Hetzner deploys a machine with
/etc/netplan/50-cloud-init.yamlpresent - On first boot, cloud-init invokes
netplan generateand creates/run/systemd/network/10-netplan-eth0.networkand10-netplan-eth0.linkfiles and systemd configures eth0 accordingly - OSM runs bootstrap script, which deletes
/etc/netplan/50-cloud-init.yamland disables cloud-init. Remember the file is gone now! - The machine is rebooted and Ubuntu 24.04 runs
/etc/systemd/system-generators/netplanearly in the boot process. This essentially invokesnetplan generateagain. Since there is no/etc/netplan/50-cloud-init.yamlany more, it also wipes/run/systemd/network/. I'm not sure if this is a thing in Ubuntu <=22.04 as well - The node proceeds to join cluster and everything is fine
As long as networking is not restarted, systemd will still maintain to manage eth0, i.e. handling dhcp, link and stuff.
But since Ubuntu does unattended upgrades by default, over time there will be packages upgraded which invoke a systemctl restart systemd-networkd. At that time, systemd will not manage eth0 any more as in 4. the files were wiped. This is still not an issue, as long as eth0 stays up.
But also recently, Hetzner has some weird network quirks. Links go down from time to time:
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Link DOWN
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Lost carrier
That's when everything goes south. The link is not managed any more and will not be taken up again, the node is gone. Control planes and any other instance sustain this by recovering the link, just not the worker nodes.
Possible solutions are:
a) Do not delete /etc/netplan/50-cloud-init.yaml. Since cloud init is disabled afterwards anyways, I see no problem in leaving it there. That being said without knowing your reason to remove it in the first place. Having the file still there would prevent any netplan generate runs from wiping out network config.
b) Disable the systemd generator, so it does not run netplan generate on boot. I'm not sure if this invokes any other issues later on.
ln -s /dev/null /etc/systemd/system-generators/netplan
systemctl daemon-reload
c) Disable unattended upgrades to prevent networking restarts. But I'd rather have them with a stable network config
Please check your worker nodes for existence of files in /run/systemd/network. If empty, you're most likely prone to outages.
@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?
@xrstf Coincidence you just modified the title of the rotten https://github.com/kubermatic/machine-controller/issues/1587 exactly at the time of our first outage? You might noticed something similar?
I saw the ticket when looking through some board and its typo irked me for a long time. It's pure coincidence I randomly edited it recently :) Even if it wasn't, I would not publicly admit to my superpowers of remotely removing IPs from other people's servers.
@xrstf I couldn't find the ticket because I was searching for "loses" during the outage at that time. Just noticed you changed it shortly after and thought you experienced similar and changed it therefore. Didn't mean to make you responsible for our outage :)
Internal reference: 8226
@csengerszabo @adoi do we have any updates on this ticket?
I had a similar issue and resolved it like this: I set up servers on Hetzner in a private network using cloud-config which creates the following netplan configuration file. I was experiencing network disconnections after running apt upgrade, along with extremely slow boot times due to network timeouts. After adding renderer: networkd and dhcp4: true, the issues were resolved. The network is now working fine.
#cloud-config
write_files:
- content: |
network:
version: 2
renderer: networkd # <------ 1. this was missing
ethernets:
enp7s0:
dhcp4: true # <------ 2. this was missing
routes:
- to: default
via: 10.42.0.1
nameservers:
addresses: [10.42.0.2]
path: /etc/netplan/01-mynetwork.yaml
runcmd:
# apply netplan file
- [ netplan, generate ]
- [ netplan, apply ]