gce: multi-NIC non-GoogleCloud VM initialization stuck iterating over invalid NICs
Bug report
Multi-NIC VMs, using GCE metadata flavor, but not running on the Google platform, get stuck in a loop trying to set up DHCP on invalid NICs. #4163 introduced a loop over candidate net interfaces, which tries to find a NIC with access to metadata.
In our VMs canidate NICs are: eth0, ib0, ib1, … ib7. And the first interface is luckily the one with the access to metadata. After a successful network setup on eth0, reading the metadata "fails" with the platform check, which seems like a valid enough result to break out of the loop. However, currently this result is treated as an error, and we keep iterating over the rest of the interfaces. Each iteration times out after 5 minutes trying to setup DHCP, resulting in a VM boot time increase by 40 minutes total.
We could probably work around this by disabling GCE network activation entirely. But this seems like an easy fix:
--- a/cloudinit/sources/DataSourceGCE.py
+++ b/cloudinit/sources/DataSourceGCE.py
@@ -123,7 +123,7 @@ class DataSourceGCE(sources.DataSource):
continue
except NoDHCPLeaseError:
continue
- if ret["success"]:
+ if ret["success"] or not ret["platform_reports_gce"]:
self._fallback_interface = candidate_nic
LOG.debug("Primary NIC found: %s.", candidate_nic)
break
What do you think @aciba90 , @blackboxsw ?
Steps to reproduce the problem
I assume this should be easily reproduced by setting platform_reports_gce = False in a GCE multi-nic unit test.
Environment details
- Cloud-init version: 23.3, 23.4
- Operating System Distribution: Ubuntu 22.04
- Cloud provider, platform or installer type: Nebius, QEMU
cloud-init logs
2023-11-08 16:34:29,350 - DataSourceGCE.py[DEBUG]: Looking for the primary NIC in: ['eth0', 'ib0', 'ib1', 'ib2', 'ib3', 'ib4', 'ib5', 'ib6', 'ib7']
2023-11-08 16:34:29,350 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:34:29,350 - dhcp.py[DEBUG]: Performing a dhcp discovery on eth0
2023-11-08 16:34:29,350 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,376 - util.py[DEBUG]: Reading from /sys/class/net/eth0/type (quiet=False)
2023-11-08 16:34:29,376 - util.py[DEBUG]: Read 2 bytes from /sys/class/net/eth0/type
2023-11-08 16:34:29,376 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,430 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:34:29,430 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:34:29,430 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:34:29,430 - util.py[DEBUG]: Reading from /proc/2971/stat (quiet=True)
2023-11-08 16:34:29,430 - util.py[DEBUG]: Read 312 bytes from /proc/2971/stat
2023-11-08 16:34:29,430 - dhcp.py[DEBUG]: killing dhclient with pid=2971
2023-11-08 16:34:29,431 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:34:29,431 - util.py[DEBUG]: Read 596 bytes from /run/dhclient.lease
2023-11-08 16:34:29,431 - ephemeral.py[DEBUG]: Received dhcp lease on eth0 for 10.2.0.32/255.255.0.0
2023-11-08 16:34:29,431 - ephemeral.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 10.2.0.32/16 brd 10.2.255.255
2023-11-08 16:34:29,431 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '10.2.0.32/16', 'broadcast', '10.2.255.255', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,432 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,433 - subp.py[DEBUG]: Running command ['ip', 'route', 'show', '0.0.0.0/0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,434 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '10.2.0.1', 'dev', 'eth0', 'src', '10.2.0.32'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,435 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '10.2.0.1', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,436 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name
2023-11-08 16:34:29,436 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_serial
2023-11-08 16:34:29,436 - DataSourceGCE.py[DEBUG]: Not running on google cloud. product-name=xeon-platinum-8468 serial=NC-dp78kgibid5ktg3mncos
2023-11-08 16:34:29,436 - util.py[DEBUG]: Crawl of GCE metadata service took 0.000 seconds
2023-11-08 16:34:29,436 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', 'default', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,437 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '10.2.0.1', 'dev', 'eth0', 'src', '10.2.0.32'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,438 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'down'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,440 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'del', '10.2.0.32/16', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,441 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:34:29,441 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib0
2023-11-08 16:34:29,441 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:34:29,526 - util.py[DEBUG]: Reading from /sys/class/net/ib0/type (quiet=False)
2023-11-08 16:34:29,526 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib0/type
2023-11-08 16:34:29,526 - util.py[DEBUG]: Reading from /sys/class/net/ib0/address (quiet=False)
2023-11-08 16:34:29,526 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib0/address
2023-11-08 16:34:29,526 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib0-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:34:29,527 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib0-dhclient.conf', 'ib0'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:39:30,432 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:39:30,432 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:39:30,432 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:39:30,432 - util.py[DEBUG]: Reading from /proc/2989/stat (quiet=True)
2023-11-08 16:39:30,432 - util.py[DEBUG]: Read 311 bytes from /proc/2989/stat
2023-11-08 16:39:30,432 - dhcp.py[DEBUG]: killing dhclient with pid=2989
2023-11-08 16:39:30,432 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:39:30,432 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 16:39:30,432 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:39:30,432 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib1
2023-11-08 16:39:30,432 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib1', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:39:30,517 - util.py[DEBUG]: Reading from /sys/class/net/ib1/type (quiet=False)
2023-11-08 16:39:30,517 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib1/type
2023-11-08 16:39:30,517 - util.py[DEBUG]: Reading from /sys/class/net/ib1/address (quiet=False)
2023-11-08 16:39:30,517 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib1/address
2023-11-08 16:39:30,517 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib1-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:39:30,517 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib1-dhclient.conf', 'ib1'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:44:31,180 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:44:31,180 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:44:31,180 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:44:31,180 - util.py[DEBUG]: Reading from /proc/3021/stat (quiet=True)
2023-11-08 16:44:31,180 - util.py[DEBUG]: Read 311 bytes from /proc/3021/stat
2023-11-08 16:44:31,181 - dhcp.py[DEBUG]: killing dhclient with pid=3021
2023-11-08 16:44:31,181 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:44:31,181 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 16:44:31,181 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:44:31,181 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib2
2023-11-08 16:44:31,181 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib2', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:44:31,267 - util.py[DEBUG]: Reading from /sys/class/net/ib2/type (quiet=False)
2023-11-08 16:44:31,267 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib2/type
2023-11-08 16:44:31,267 - util.py[DEBUG]: Reading from /sys/class/net/ib2/address (quiet=False)
2023-11-08 16:44:31,267 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib2/address
2023-11-08 16:44:31,267 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib2-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:44:31,267 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib2-dhclient.conf', 'ib2'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:49:32,315 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:49:32,315 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:49:32,315 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:49:32,315 - util.py[DEBUG]: Reading from /proc/3033/stat (quiet=True)
2023-11-08 16:49:32,315 - util.py[DEBUG]: Read 311 bytes from /proc/3033/stat
2023-11-08 16:49:32,315 - dhcp.py[DEBUG]: killing dhclient with pid=3033
2023-11-08 16:49:32,315 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:49:32,315 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 16:49:32,316 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:49:32,316 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib3
2023-11-08 16:49:32,316 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib3', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:49:32,403 - util.py[DEBUG]: Reading from /sys/class/net/ib3/type (quiet=False)
2023-11-08 16:49:32,403 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib3/type
2023-11-08 16:49:32,403 - util.py[DEBUG]: Reading from /sys/class/net/ib3/address (quiet=False)
2023-11-08 16:49:32,403 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib3/address
2023-11-08 16:49:32,403 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib3-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:49:32,403 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib3-dhclient.conf', 'ib3'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:54:33,748 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:54:33,748 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:54:33,748 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:54:33,748 - util.py[DEBUG]: Reading from /proc/3044/stat (quiet=True)
2023-11-08 16:54:33,748 - util.py[DEBUG]: Read 312 bytes from /proc/3044/stat
2023-11-08 16:54:33,748 - dhcp.py[DEBUG]: killing dhclient with pid=3044
2023-11-08 16:54:33,748 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:54:33,748 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 16:54:33,748 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:54:33,748 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib4
2023-11-08 16:54:33,748 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib4', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:54:33,835 - util.py[DEBUG]: Reading from /sys/class/net/ib4/type (quiet=False)
2023-11-08 16:54:33,835 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib4/type
2023-11-08 16:54:33,835 - util.py[DEBUG]: Reading from /sys/class/net/ib4/address (quiet=False)
2023-11-08 16:54:33,835 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib4/address
2023-11-08 16:54:33,835 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib4-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:54:33,835 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib4-dhclient.conf', 'ib4'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:59:34,387 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 16:59:34,387 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 16:59:34,387 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 16:59:34,387 - util.py[DEBUG]: Reading from /proc/3058/stat (quiet=True)
2023-11-08 16:59:34,387 - util.py[DEBUG]: Read 312 bytes from /proc/3058/stat
2023-11-08 16:59:34,387 - dhcp.py[DEBUG]: killing dhclient with pid=3058
2023-11-08 16:59:34,387 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 16:59:34,387 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 16:59:34,387 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 16:59:34,387 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib5
2023-11-08 16:59:34,388 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib5', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 16:59:34,471 - util.py[DEBUG]: Reading from /sys/class/net/ib5/type (quiet=False)
2023-11-08 16:59:34,471 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib5/type
2023-11-08 16:59:34,471 - util.py[DEBUG]: Reading from /sys/class/net/ib5/address (quiet=False)
2023-11-08 16:59:34,471 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib5/address
2023-11-08 16:59:34,471 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib5-dhclient.conf - wb: [644] 73 bytes
2023-11-08 16:59:34,471 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib5-dhclient.conf', 'ib5'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 17:04:35,450 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 17:04:35,450 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 17:04:35,450 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 17:04:35,450 - util.py[DEBUG]: Reading from /proc/3071/stat (quiet=True)
2023-11-08 17:04:35,450 - util.py[DEBUG]: Read 312 bytes from /proc/3071/stat
2023-11-08 17:04:35,450 - dhcp.py[DEBUG]: killing dhclient with pid=3071
2023-11-08 17:04:35,450 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 17:04:35,450 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 17:04:35,450 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 17:04:35,450 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib6
2023-11-08 17:04:35,450 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib6', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 17:04:35,535 - util.py[DEBUG]: Reading from /sys/class/net/ib6/type (quiet=False)
2023-11-08 17:04:35,535 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib6/type
2023-11-08 17:04:35,535 - util.py[DEBUG]: Reading from /sys/class/net/ib6/address (quiet=False)
2023-11-08 17:04:35,535 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib6/address
2023-11-08 17:04:35,535 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib6-dhclient.conf - wb: [644] 73 bytes
2023-11-08 17:04:35,535 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib6-dhclient.conf', 'ib6'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 17:09:36,738 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 17:09:36,738 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 17:09:36,738 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 17:09:36,738 - util.py[DEBUG]: Reading from /proc/3082/stat (quiet=True)
2023-11-08 17:09:36,738 - util.py[DEBUG]: Read 312 bytes from /proc/3082/stat
2023-11-08 17:09:36,738 - dhcp.py[DEBUG]: killing dhclient with pid=3082
2023-11-08 17:09:36,738 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 17:09:36,738 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 17:09:36,738 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2023-11-08 17:09:36,738 - dhcp.py[DEBUG]: Performing a dhcp discovery on ib7
2023-11-08 17:09:36,739 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'ib7', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 17:09:36,821 - util.py[DEBUG]: Reading from /sys/class/net/ib7/type (quiet=False)
2023-11-08 17:09:36,821 - util.py[DEBUG]: Read 3 bytes from /sys/class/net/ib7/type
2023-11-08 17:09:36,821 - util.py[DEBUG]: Reading from /sys/class/net/ib7/address (quiet=False)
2023-11-08 17:09:36,821 - util.py[DEBUG]: Read 60 bytes from /sys/class/net/ib7/address
2023-11-08 17:09:36,821 - util.py[DEBUG]: Writing to /var/tmp/cloud-init/ib7-dhclient.conf - wb: [644] 73 bytes
2023-11-08 17:09:36,822 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', '-cf', '/var/tmp/cloud-init/ib7-dhclient.conf', 'ib7'] with allowed return codes [0] (shell=False, capture=True)
2023-11-08 17:14:37,014 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2023-11-08 17:14:37,014 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2023-11-08 17:14:37,014 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2023-11-08 17:14:37,014 - util.py[DEBUG]: Reading from /proc/3094/stat (quiet=True)
2023-11-08 17:14:37,014 - util.py[DEBUG]: Read 312 bytes from /proc/3094/stat
2023-11-08 17:14:37,014 - dhcp.py[DEBUG]: killing dhclient with pid=3094
2023-11-08 17:14:37,014 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2023-11-08 17:14:37,014 - util.py[DEBUG]: Read 0 bytes from /run/dhclient.lease
2023-11-08 17:14:37,014 - DataSourceGCE.py[WARNING]: Did not find a fallback interface on gce.
2023-11-08 17:14:37,014 - DataSourceGCE.py[DEBUG]: Not running on GCE.
using GCE metadata flavor, but not running on the Google platform ... Cloud provider, platform or installer type: Nebius, QEMU
Hey that's neat. I didn't know cloud-init was used by Nebius. Is this happening on one of their provided cloud images? Or are you just trying out cloud-init for the first time there?
This cloud is technically not supported yet, so maybe we should create a new datasource to support Nebius.
@dalekenium Are you a user or developer of Nebius?
Was this datasource detected by ds-identify at runtime? Or was this manually configured to use this datasource?
Any clue how their IMDS works? Is this cloud trying to mimic GCE?
I’m a developer from Nebius, but I’m mainly responsible for the kernel/drivers, so I could be wrong about the metadata. That said, yes, I believe all of our GPU prod images use cloud-init. And we emulate both GCE and EC2 for wider compatibility with existing VM images, with manual configuration:
echo 'datasource_list: [ NoCloud, GCE, Ec2, None ]' | tee /etc/cloud/cloud.cfg.d/90_dpkg.cfg
That said, yes, I believe all of our GPU prod images use cloud-init. And we emulate both GCE and EC2 for wider compatibility with existing VM images, with manual configuration: echo 'datasource_list: [ NoCloud, GCE, Ec2, None ]' | tee /etc/cloud/cloud.cfg.d/90_dpkg.cfg
I guess what you want to avoid is cloud-init re-running and re-creating ssh keys for their transferred instances.
Without https://github.com/canonical/cloud-init/pull/4163, the behavior you had was:
- Your platform does not have Google's dmi and
ret["platform_reports_gce"]is set to False - read_md was executed with eth0 as fallback_nic and exited early without fetching the meta-data nor user-data.
- _get_data returns False.
- In a later cloud-init boot-stage the metadata and userdata is fetched.
Right?
The current behavior is the same but adding 5 mins of delay per non-eth0 NICs.
Therefore, applying the suggested patch would only remove the delayed time. While it would solve your problem, I think it is really difficult to understand and maintain, and it could potentially break other setups on GCE.
Additionally, if more deviation between GCE and Nebius arises or other non-GCE platforms using this datasource appear, then the difficulty to maintain this would increase exponentially.
I think it would be better to either, create a DataSourceNebius and clearly expressing the requirements of your cloud, or solving it by other means.
Could you please share an example /run/cloud/ds-identify.log?
echo 'datasource_list: [ NoCloud, GCE, Ec2, None ]' | tee /etc/cloud/cloud.cfg.d/90_dpkg.cfg
Curious, why NoCloud first? Also, why bother with the Ec2 entry, since presumably GCE should always run?
applying the suggested patch would only remove the delayed time. While it would solve your problem, I think it is really difficult to understand and maintain, and it could potentially break other setups on GCE.
+1
I think it would be better to either, create a DataSourceNebius and clearly expressing the requirements of your cloud, or solving it by other means.
Since you attempt to emulate GCE and EC2, it should be pretty easy to add a new datasource definition based on one of theirs (which we can patch as you suggest, or customize to better suit your needs). The "uniquely identify the cloud" would be the important requirement to get started on that. The cloud-init code for that lives in ds-identify.
Sorry for the delay, attaching the ds-identify.log. I also did some more digging into the issue.
- In regards to the NoCloud - I found it being used in a packer run, that generates our custom VM images from vanilla Ubuntu (with a cloud-init clean at the end). However I don’t think it’s actually being used after that, in prod VMs
- With GCE and EC2 - by default we emulate both, and even give users toggles to choose between them, when creating a VM. There’s even a helpful note, clarifying that EC2 (IMDSv1) has several security gaps, thus GCE is a better option. Docs for our metadata.
Why we went with that instead of a new custom datasource is a good question, we might want to reconsider in the future. The main reason to avoid a new datasource seems to be compatibility with existing metadata parsers (including, but perhaps not limited to cloud-init).
That said, you’re right, that our problem is not with the NIC timeouts, but rather with the fact that cloud-init GCE doesn’t seem to work at all with our platform. For example, if the user explicitly disables EC2 toggle, his metadata simply doesn’t reach the VM due to the failed platform check.
I see that EC2 has multiple registered platforms, and even a helpful option “strict_id = false” (which we actually also use, since our platform isn’t properly registered there either). Is there a reason why we can’t register our platform with GCE? And, while we're at it, with EC2 as well.
Looks like we can be identified by DMI_SYS_VENDOR=Nebius and perhaps DMI_PRODUCT_SERIAL=NCP-*