integration test timeouts ec2: ephemeral network config on Ec2 when calling `cloud-init init --local ` on the commandline after cloud-init clean
Bug report
Recent commit f74b61eae24706dd313efc06792198a718f06672 provided support for EphemeralIPNetwork context manager for dhcpcd to bring down the link and del the associated address to avoid errors with network interface renames per #5113.
This introduced a side effect that this link teardown would be performed on anytime cloud-init init --local is run on the command line if the following conditions are met:
- running in an image with dhcpcd installed as the preferred dhcp client instead of isc-dhclient. (Ubuntu Noble)
cloud-init cleanis run to allow cloud-init to treat the instance asnew- then
cloud-init init --localon the CLI from the running instance on a cloud platform which relies oh EphemeralIPNetwork context manager to setup networking in init-local timeframe in order to contact the IMDS (Ec2 and GCP)
In this case, a CLI call to cloud-init clean; cloud-init init --local will attempt to rerun dhcpcd client to re-obtain IP information regardless of the fact that the interface was already configured and up. Once the DHCP lease is obtained from dhcpcd it'll bring the link down and delete the address associated with the network interface. This leaves the primary network down and causes integration tests to timeout resulting in integration test job failures.
Steps to reproduce the problem
Launch and ec2 Ubuntu Noble instance ssh into the VM and run cloud-init clean; cloud-init init --local; Expect connection to be dropped.
Or run the integration tests below against Ec2 or GCE
CLOUD_INIT_PLATFORM=ec2 tox -e -integration-tests -- tests/integration_tests/datasources/test_ec2_ipv6.py::test_dual_stack
- or -
CLOUD_INIT_PLATFORM=gce tox -e -integration-tests -- tests/integration_tests/reporting/test_webhook_reporting.py
Environment details
- Cloud-init version: 24.1.3
- Operating System Distribution: Ubuntu
- Cloud provider, platform or installer type: ec2
test
timeouts on jenkins jobs:
Thanks for filing this issue @blackboxsw.
I am working on addressing this behavior change for our integration tests. Some users using cloud-init in unsupported ways may benefit from this work, but I don't want this effort to be misunderstood as endorsement of using cloud-init in this way.
We should be clear that this isn't an intended way of using cloud-init. Users should not be running cloud-init commands manually like this. Anyone running cloud-init manually in this way should understand that they are relying on undefined, untested, and unsupported behavior.
It happens to be useful to be able to do this in integration tests. As far as I'm concerned, the integration test fix is the only reason to address this change in behavior. Not changing behavior unnecessarily on users is a reasonable ideal. However, there are many side-effects from cloud-init commands that we simply cannot guarantee in a manual non-boot context. A user that runs this command manually and sees this behavior hasn't destroyed their system. All they've done is set a link down; recovery is simple (reboot/console/etc).
This has been fixed, closing.