operating-system-manager icon indicating copy to clipboard operation
operating-system-manager copied to clipboard

Hetzner worker nodes losing IPv4 with Ubuntu 24.04

Open 7oku opened this issue 9 months ago • 2 comments

Similar to kubermatic/machine-controller#1587 we were hit by 2 complete outages of Hetzner cloud worker nodes recently again. We are at kubeone 1.9.0 with OSM 1.6.0.

I found out what exactly happens and it begins already with the deployment of the worker:

  1. Hetzner deploys a machine with /etc/netplan/50-cloud-init.yaml present
  2. On first boot, cloud-init invokes netplan generate and creates /run/systemd/network/10-netplan-eth0.network and 10-netplan-eth0.link files and systemd configures eth0 accordingly
  3. OSM runs bootstrap script, which deletes /etc/netplan/50-cloud-init.yaml and disables cloud-init. Remember the file is gone now!
  4. The machine is rebooted and Ubuntu 24.04 runs /etc/systemd/system-generators/netplan early in the boot process. This essentially invokes netplan generate again. Since there is no /etc/netplan/50-cloud-init.yaml any more, it also wipes /run/systemd/network/. I'm not sure if this is a thing in Ubuntu <=22.04 as well
  5. The node proceeds to join cluster and everything is fine

As long as networking is not restarted, systemd will still maintain to manage eth0, i.e. handling dhcp, link and stuff. But since Ubuntu does unattended upgrades by default, over time there will be packages upgraded which invoke a systemctl restart systemd-networkd. At that time, systemd will not manage eth0 any more as in 4. the files were wiped. This is still not an issue, as long as eth0 stays up.

But also recently, Hetzner has some weird network quirks. Links go down from time to time:

Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Link DOWN
Feb 22 23:15:03 cluster-pool1-68546f7bb8-psz6n systemd-networkd[1800530]: lxc_health: Lost carrier

That's when everything goes south. The link is not managed any more and will not be taken up again, the node is gone. Control planes and any other instance sustain this by recovering the link, just not the worker nodes.

Possible solutions are:

a) Do not delete /etc/netplan/50-cloud-init.yaml. Since cloud init is disabled afterwards anyways, I see no problem in leaving it there. That being said without knowing your reason to remove it in the first place. Having the file still there would prevent any netplan generate runs from wiping out network config.

b) Disable the systemd generator, so it does not run netplan generate on boot. I'm not sure if this invokes any other issues later on.

ln -s /dev/null /etc/systemd/system-generators/netplan
systemctl daemon-reload

c) Disable unattended upgrades to prevent networking restarts. But I'd rather have them with a stable network config

Please check your worker nodes for existence of files in /run/systemd/network. If empty, you're most likely prone to outages.

@xrstf Coincidence you just modified the title of the rotten kubermatic/machine-controller#1587 exactly at the time of our first outage? You might noticed something similar?

7oku avatar Feb 25 '25 07:02 7oku

@xrstf Coincidence you just modified the title of the rotten https://github.com/kubermatic/machine-controller/issues/1587 exactly at the time of our first outage? You might noticed something similar?

I saw the ticket when looking through some board and its typo irked me for a long time. It's pure coincidence I randomly edited it recently :) Even if it wasn't, I would not publicly admit to my superpowers of remotely removing IPs from other people's servers.

xrstf avatar Feb 25 '25 08:02 xrstf

@xrstf I couldn't find the ticket because I was searching for "loses" during the outage at that time. Just noticed you changed it shortly after and thought you experienced similar and changed it therefore. Didn't mean to make you responsible for our outage :)

7oku avatar Feb 25 '25 09:02 7oku

Internal reference: 8226

csengerszabo avatar Jun 26 '25 07:06 csengerszabo

@csengerszabo @adoi do we have any updates on this ticket?

ahmedwaleedmalik avatar Jul 28 '25 09:07 ahmedwaleedmalik

I had a similar issue and resolved it like this: I set up servers on Hetzner in a private network using cloud-config which creates the following netplan configuration file. I was experiencing network disconnections after running apt upgrade, along with extremely slow boot times due to network timeouts. After adding renderer: networkd and dhcp4: true, the issues were resolved. The network is now working fine.

#cloud-config
write_files:
  - content: |
      network:
        version: 2
        renderer: networkd # <------ 1. this was missing
        ethernets:
          enp7s0:
            dhcp4: true  # <------ 2. this was missing
            routes:
              - to: default
                via: 10.42.0.1
            nameservers:
              addresses: [10.42.0.2]
    path: /etc/netplan/01-mynetwork.yaml

runcmd:
  # apply netplan file
  - [ netplan, generate ]
  - [ netplan, apply ]

stebi avatar Jul 28 '25 10:07 stebi