Networking failures after NIC reordering
This bug was originally filed in Launchpad as LP: #1958280
Launchpad details
affected_projects = ['netplan'] assignee = None assignee_name = None date_closed = None date_created = 2022-01-18T17:19:31.963574+00:00 date_fix_committed = None date_fix_released = None id = 1958280 importance = high is_complete = False lp_url = https://bugs.launchpad.net/cloud-init/+bug/1958280 milestone = None owner = cjp256 owner_name = Chris Patterson private = False status = triaged submitter = cjp256 submitter_name = Chris Patterson tags = [] duplicates = []
Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:31.963574+00:00
We can reliably reproduce a case where network configuration changes for an Ubuntu 20.04 VM results in a networkd hanging on "pending" interfaces. The interfaces are pending because of conflicts in naming from the current boot and that found in /etc/netplan/50-cloud-init.yaml from previous boot
Specifically, the netplan generator applies the previous configuration's names prior to running cloud-init local. We'll see something like systemd-udevd[228]: eth0: Failed to process device, ignoring: File exists.
In one scenario, the data source is able to fetch updated network configuration, and cloud-init updates the config & udev rules just fine. However, networking stays offline ("pending") indefinitely. It can be forced to resolve by executing sudo udevadm trigger --attr-match=subsystem=net.
Example: Create a VM on Azure with two NICs, re-order them, then restart.
az vm create --name test-x1 --image Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest --nics test-nic-01 test-nic-02 az vm deallocate --name test-x1 az vm nics set --vm-name test-x1 --nics test-nic-02 test-nic-01 az vm start --name test-x1
Upon doing that I am unable to login via serial console for 20 minutes until cloud init times out. In this case, Azure is trying to report ready but cannot because system networking never came up. We can remove /lib/systemd/system/cloud-init-local.service.d/50-azure-clear-persistent-obj-pkl.conf, cloud-init doesn't hang the boot, but networking still fails to initialize for the guest.
The behavior for 18.04 is a bit different. On 18.04, the renaming of the interfaces succeeds at early boot, which instead results in the Azure data source failing the local phase because the fallback_interface is no longer the primary NIC (eth1 secondary was renamed to eth0 to match previous boot's config).
Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:31.963574+00:00
Launchpad attachments: Ubuntu 20.04 nic swap logs
Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:53.411383+00:00
Launchpad attachments: Ubuntu 18.04 nic swap logs
Launchpad user James Falcon(falcojr) wrote on 2022-01-19T22:52:55.104750+00:00
Thanks for the thorough bug report. I have confirmed the 20.04 behavior and the root cause.
Launchpad user James Falcon(falcojr) wrote on 2022-01-21T17:22:08.679657+00:00
Adding netplan here as cloud-init is generating the netplan config correctly before network comes up.
@TheRealFalcon feel free to assign this to me
@cjp256 any ideas on how to resolve this one? At first blush, this looks like a netplan issue. I'm not sure how we would solve this in cloud-init that wouldn't just be a workaround.
@holmanb The only option I think will work is dropping set-name usage for Azure datasource in the netplan config. They should be ordered fine during system enumeration (i.e. eth0 is primary until we force it to swap due to config).
My concern is potential for side effects where set-name may be important. I don't know of a situation, but hard to say there isn't one. Maybe we can add an option to the Azure datasource to toggle the behavior and change the default only for new distro versions to minimize risk?