cloud-init intermittent to fail rename network interface
Bug report
On Azurelinux with cloud-init 24.3 we are running test and seeing intermittent issue where cloud-init failed to rename the network interface. I found it is may due to a racing condition between cloud-init and systemd-networkd. Cloud-init renaming command: cloud-init failed to run cmd ['ip', 'link', 'set', 'eth7', 'name', 'eth1'] due to there is still an eth1 exists after cloud-init already renaming eth1 to cirename0. Cloud-init gets eventually errors out due to unexpected two eth1 exist. From the systemd logs I am seeing systemd and cloud-init trying to apply the network config and renaming at same time so I suspect this is how the race condition came along:
Steps to reproduce the problem
Intermittent test failed where multiple nics were populated
Environment details
- Cloud-init version: 24.3
- Operating System Distribution: azurelinux
- Cloud provider, platform or installer type: Azure host
cloud-init logs
2024-11-18 08:15:04,840 - subp.py[DEBUG]: Running command ['ip', '-6', 'addr', 'show', 'permanent', 'scope', 'global'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,842 - subp.py[DEBUG]: Running command ['ip', '-4', 'addr', 'show'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,843 - net[DEBUG]: Detected interfaces {'eth6': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:33:e3', 'name': 'eth6', 'up': False}, 'eth4': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:38:a6', 'name': 'eth4', 'up': False}, 'lo': {'downable': False, 'device_id': None, 'driver': None, 'mac': '00:00:00:00:00:00', 'name': 'lo', 'up': True}, 'eth2': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:3d:44', 'name': 'eth2', 'up': False}, 'eth0': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:38:38', 'name': 'eth0', 'up': False}, 'eth7': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:3d:59', 'name': 'eth7', 'up': False}, 'eth5': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:3b:fe', 'name': 'eth5', 'up': False}, 'eth3': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:3a:c8', 'name': 'eth3', 'up': False}, 'eth1': {'downable': True, 'device_id': '0x3', 'driver': 'hv_netvsc', 'mac': '60:45:bd:ae:3b:81', 'name': 'eth1', 'up': False}} 2024-11-18 08:15:04,844 - net[DEBUG]: Renamed [['60:45:bd:ae:38:38', 'eth0', 'hv_netvsc', '0x3'], ['60:45:bd:ae:3d:59', 'eth1', 'hv_netvsc', '0x3'], ['60:45:bd:ae:3b:fe', 'eth2', 'hv_netvsc', '0x3'], ['60:45:bd:ae:3d:44', 'eth3', 'hv_netvsc', '0x3'], ['60:45:bd:ae:3a:c8', 'eth4', 'hv_netvsc', '0x3'], ['60:45:bd:ae:38:a6', 'eth5', 'hv_netvsc', '0x3'], ['60:45:bd:ae:33:e3', 'eth6', 'hv_netvsc', '0x3'], ['60:45:bd:ae:3b:81', 'eth7', 'hv_netvsc', '0x3']] with ops [('rename', '60:45:bd:ae:3d:59', 'eth1', ('eth1', 'cirename0')), ('rename', '60:45:bd:ae:3d:59', 'eth1', ('eth7', 'eth1')), ('rename', '60:45:bd:ae:3b:fe', 'eth2', ('eth2', 'cirename1')), ('rename', '60:45:bd:ae:3b:fe', 'eth2', ('eth5', 'eth2')), ('rename', '60:45:bd:ae:3d:44', 'eth3', ('eth3', 'cirename2')), ('rename', '60:45:bd:ae:3d:44', 'eth3', ('cirename1', 'eth3')), ('rename', '60:45:bd:ae:3a:c8', 'eth4', ('eth4', 'cirename3')), ('rename', '60:45:bd:ae:3a:c8', 'eth4', ('cirename2', 'eth4')), ('rename', '60:45:bd:ae:38:a6', 'eth5', ('cirename3', 'eth5')), ('rename', '60:45:bd:ae:3b:81', 'eth7', ('cirename0', 'eth7'))] 2024-11-18 08:15:04,844 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth1', 'name', 'cirename0'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,876 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth7', 'name', 'eth1'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,880 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth2', 'name', 'cirename1'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,912 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth5', 'name', 'eth2'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,944 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth3', 'name', 'cirename2'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:04,972 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'cirename1', 'name', 'eth3'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:05,008 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth4', 'name', 'cirename3'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:05,032 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'cirename2', 'name', 'eth4'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:05,060 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'cirename3', 'name', 'eth5'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:05,096 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'cirename0', 'name', 'eth7'] with allowed return codes [0] (shell=False, capture=True) 2024-11-18 08:15:05,098 - stages.py[WARNING]: Failed to rename devices: Failed to apply network config names: [unknown] Error performing rename('eth7', 'eth1') for 60:45:bd:ae:3d:59, eth1: Unexpected error while running command. Command: ['ip', 'link', 'set', 'eth7', 'name', 'eth1'] Exit code: 2 Reason: - Stdout: Stderr: RTNETLINK answers: File exists [unknown] Error performing rename('cirename0', 'eth7') for 60:45:bd:ae:3b:81, eth7: Unexpected error while running command. Command: ['ip', 'link', 'set', 'cirename0', 'name', 'eth7'] Exit code: 2 Reason: - Stdout: Stderr: RTNETLINK answers: File exists cloud-init.log Cloud-init -output.log
The root cause causes the racing condition test failure is that cloud-init supports multiple network renderers and uses netplan by default order. Azurelinux should by default use networkd as first pick for default renderer. Netplan will try to generate network configs and then populate those configs for systemd-networkd to run which eventually racing with cloud-init also utilizing those configs.
The solution is set networked as the first pick network renderer in cloud-init.cfg instead of netplan. This way cloud-init will first try to use networkd for network renderer and we don't have the issue anymore.
@rmhsawyer Interesting, when did azurelinux start shipping with netplan?
We supported netplan long ago but recently caught the issue.