cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

No local datasource found caused by cloud-init set down and delete eth0

Open Jonas-zhang opened this issue 1 year ago • 2 comments

Bug report

I deploy GCE with Terraform. but cloud-init can't find data from SLES15. I think the root cause is cloud-init set down and delete eth0, so cloud-init can't get metadata from GCP. cloud-init can't connect to GCP metadata, so it think there is no datasource in this instance. After I rerun cloud-init manually, the datasource will be found and everything works well.

Steps to reproduce the problem

Environment details

  • Cloud-init version: 23.3-150100.8.82.3
  • Operating System Distribution: SLES15 SP4
  • Cloud provider, platform or installer type: Terraform GCP

cloud-init logs

Partial log in var/log/cloud-init.log.

2024-06-28 07:35:00,463 - handlers.py[DEBUG]: start: init-local/search-GCELocal: searching for local data from DataSourceGCELocal
2024-06-28 07:35:00,463 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'>
2024-06-28 07:35:00,464 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-06-28 07:35:00,464 - __init__.py[DEBUG]: Machine is configured to run on single datasource DataSourceGCELocal.
2024-06-28 07:35:00,465 - util.py[DEBUG]: Reading from /sys/class/net/eth0/name_assign_type (quiet=False)
2024-06-28 07:35:00,465 - __init__.py[DEBUG]: Found unstable nic names: ['eth0']; calling udevadm settle
2024-06-28 07:35:00,465 - subp.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:00,522 - util.py[DEBUG]: Waiting for udev events to settle took 0.057 seconds
2024-06-28 07:35:00,523 - util.py[DEBUG]: Reading from /sys/class/net/eth0/address (quiet=False)
2024-06-28 07:35:00,523 - util.py[DEBUG]: Read 18 bytes from /sys/class/net/eth0/address
2024-06-28 07:35:00,523 - util.py[DEBUG]: Reading from /sys/class/net/eth0/device/device (quiet=False)
2024-06-28 07:35:00,523 - util.py[DEBUG]: Read 7 bytes from /sys/class/net/eth0/device/device
2024-06-28 07:35:00,523 - util.py[DEBUG]: Reading from /sys/class/net/lo/address (quiet=False)
2024-06-28 07:35:00,523 - util.py[DEBUG]: Read 18 bytes from /sys/class/net/lo/address
2024-06-28 07:35:00,523 - util.py[DEBUG]: Reading from /sys/class/net/lo/device/device (quiet=False)
2024-06-28 07:35:00,523 - util.py[DEBUG]: Reading from /sys/class/net/eth0/carrier (quiet=False)
2024-06-28 07:35:00,523 - __init__.py[DEBUG]: Interface has no carrier: eth0
2024-06-28 07:35:00,524 - util.py[DEBUG]: Reading from /sys/class/net/eth0/dormant (quiet=False)
2024-06-28 07:35:00,524 - util.py[DEBUG]: Reading from /sys/class/net/eth0/operstate (quiet=False)
2024-06-28 07:35:00,524 - util.py[DEBUG]: Read 5 bytes from /sys/class/net/eth0/operstate
2024-06-28 07:35:00,524 - DataSourceGCE.py[DEBUG]: Looking for the primary NIC in: ['eth0']
2024-06-28 07:35:00,524 - dhcp.py[DEBUG]: DHCP client selected: dhclient
2024-06-28 07:35:00,524 - dhcp.py[DEBUG]: Performing a dhcp discovery on eth0
2024-06-28 07:35:00,524 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:00,536 - util.py[DEBUG]: Reading from /sys/class/net/eth0/type (quiet=False)
2024-06-28 07:35:00,536 - util.py[DEBUG]: Read 2 bytes from /sys/class/net/eth0/type
2024-06-28 07:35:00,536 - subp.py[DEBUG]: Running command ['/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,584 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
2024-06-28 07:35:05,584 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
2024-06-28 07:35:05,584 - util.py[DEBUG]: Read 5 bytes from /run/dhclient.pid
2024-06-28 07:35:05,584 - util.py[DEBUG]: Reading from /proc/1796/stat (quiet=True)
2024-06-28 07:35:05,584 - util.py[DEBUG]: Read 300 bytes from /proc/1796/stat
2024-06-28 07:35:05,585 - dhcp.py[DEBUG]: killing dhclient with pid=1796
2024-06-28 07:35:05,586 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
2024-06-28 07:35:05,586 - util.py[DEBUG]: Read 807 bytes from /run/dhclient.lease
2024-06-28 07:35:05,586 - ephemeral.py[DEBUG]: Received dhcp lease on eth0 for *****
2024-06-28 07:35:05,586 - ephemeral.py[DEBUG]: Attempting setup of ephemeral network on eth0 with ***** brd *****
2024-06-28 07:35:05,586 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '*****', 'broadcast', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,590 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,593 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,595 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '0.0.0.0/0', 'via', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,598 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name
2024-06-28 07:35:05,603 - util.py[DEBUG]: Resolving URL: http://metadata.google.internal/computeMetadata/v1/ took 0.004 seconds
2024-06-28 07:35:05,603 - DataSourceGCE.py[DEBUG]: http://metadata.google.internal/computeMetadata/v1/ is not resolvable
2024-06-28 07:35:05,603 - util.py[DEBUG]: Crawl of GCE metadata service took 0.005 seconds
2024-06-28 07:35:05,603 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '0.0.0.0/0', 'via', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,606 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,608 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'down'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,611 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'del', '*****', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2024-06-28 07:35:05,614 - DataSourceGCE.py[WARNING]: Did not find a fallback interface on gce.
2024-06-28 07:35:05,614 - DataSourceGCE.py[WARNING]: address "http://metadata.google.internal/computeMetadata/v1/" is not resolvable
2024-06-28 07:35:05,614 - __init__.py[DEBUG]: Datasource DataSourceGCELocal not updated for events: boot-new-instance
2024-06-28 07:35:05,614 - handlers.py[DEBUG]: finish: init-local/search-GCELocal: SUCCESS: no local data found from DataSourceGCELocal
2024-06-28 07:35:05,615 - main.py[DEBUG]: No local datasource found

Jonas-zhang avatar Jul 01 '24 05:07 Jonas-zhang

Thank you @Jonas-zhang for filing this bug to make cloud-init better.

Generally speaking the teardown of the ephemeral network configuration during cloud-init's init-local time frame is not what is causing the inability to detect GCE's instance metadata service(IMDS). What seems to be the problem is the inability of that network configuration to resolve the IMDS seen with this log that you captured.

http://metadata.google.internal/computeMetadata/v1/ is not resolvable

Something in that initial network config is unable to get ahold of DNS for to resolve the .internal domain for some reason. I'm guessing either an intermittent outage for the network config where this instance is being launched, of some invalid network or DNS configuration in the environment.

Is there an ability to get into the system somehow to troubleshoot these IMDS access issues? https://cloud.google.com/compute/docs/troubleshooting/troubleshoot-metadata-server

On most instances, we see the following log noting that cloud-init is able to correctly resolve the IMDS. Example of a test instance torn down below:

2024-07-02 17:43:46,738 - ephemeral.py[DEBUG]: Received dhcp lease on ens4 for 10.128.0.113/255.255.255.255
2024-07-02 17:43:46,738 - ephemeral.py[DEBUG]: Attempting setup of ephemeral network on ens4 with 10.128.0.113/32 brd 10.128.0.113
2024-07-02 17:43:46,738 - subp.py[DEBUG]: Running command ['ip', '--json', 'addr'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,741 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '10.128.0.113/32', 'broadcast', '10.128.0.113', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,743 - ephemeral.py[DEBUG]: Skip bringing up network link: interface ens4 is already up
2024-07-02 17:43:46,743 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '10.128.0.1/32', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,745 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'append', '0.0.0.0/0', 'via', '10.128.0.1', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,748 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name
2024-07-02 17:43:46,750 - util.py[DEBUG]: Resolving URL: http://metadata.google.internal/computeMetadata/v1/ took 0.002 seconds
2024-07-02 17:43:46,750 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/id' with {'url': 'http://metadata.google.internal/computeMetadata/v1/instance/id', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'Metadata-Flavor': 'Google', 'User-Agent': 'Cloud-Init/99.daily-202407012148-40e2eb4ae~ubuntu22.04.1'}} configuration
2024-07-02 17:43:46,758 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/id (200, 19b) after 1 attempts
2024-07-02 17:43:46,758 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/zone' with {'url': 'http://metadata.google.internal/computeMetadata/v1/instance/zone', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'Metadata-Flavor': 'Google', 'User-Agent': 'Cloud-Init/99.daily-202407012148-40e2eb4ae~ubuntu22.04.1'}} configuration
2024-07-02 17:43:46,761 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/zone (200, 41b) after 1 attempts
2024-07-02 17:43:46,761 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/hostname' with {'url': 'http://metadata.google.internal/computeMetadata/v1/instance/hostname', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'Metadata-Flavor': 'Google', 'User-Agent': 'Cloud-Init/99.daily-202407012148-40e2eb4ae~ubuntu22.04.1'}} configuration
2024-07-02 17:43:46,764 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/hostname (200, 74b) after 1 attempts
2024-07-02 17:43:46,764 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/attributes/?recursive=True' with {'url': 'http://metadata.google.internal/computeMetadata/v1/instance/attributes/?recursive=True', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'Metadata-Flavor': 'Google', 'User-Agent': 'Cloud-Init/99.daily-202407012148-40e2eb4ae~ubuntu22.04.1'}} configuration
2024-07-02 17:43:46,767 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/attributes/?recursive=True (200, 671b) after 1 attempts
2024-07-02 17:43:46,767 - url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/project/attributes/?recursive=True' with {'url': 'http://metadata.google.internal/computeMetadata/v1/project/attributes/?recursive=True', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'Metadata-Flavor': 'Google', 'User-Agent': 'Cloud-Init/99.daily-202407012148-40e2eb4ae~ubuntu22.04.1'}} configuration
2024-07-02 17:43:46,769 - url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/project/attributes/?recursive=True (200, 1221b) after 1 attempts
2024-07-02 17:43:46,769 - util.py[DEBUG]: Crawl of GCE metadata service took 0.022 seconds
2024-07-02 17:43:46,769 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '0.0.0.0/0', 'via', '10.128.0.1', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,772 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '10.128.0.1/32', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,774 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'ens4', 'down'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,776 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'del', '10.128.0.113/32', 'dev', 'ens4'] with allowed return codes [0] (shell=False, capture=True)
2024-07-02 17:43:46,778 - DataSourceGCE.py[DEBUG]: Primary NIC found: ens4.
2024-07-02 17:43:46,779 - util.py[DEBUG]: Writing to /run/cloud-init/cloud-id-gce - wb: [644] 4 bytes

blackboxsw avatar Jul 02 '24 20:07 blackboxsw

I'd be curious what the dhcp lease contains cat '/run/dhclient.lease' which cloud-init received from Google's DHCP service during launch.

blackboxsw avatar Jul 02 '24 20:07 blackboxsw

Thanks @blackboxsw for your comments! You are right, there is DNS resolve issue in my environment. /etc/resolv.conf can't be generated automatic. Which causes http://metadata.google.internal/computeMetadata/v1 can't be resolved. This bug can be closed, thanks again!

Jonas-zhang avatar Jul 04 '24 09:07 Jonas-zhang

Thanks for the comment @Jonas-zhang!

blackboxsw avatar Jul 04 '24 12:07 blackboxsw