cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

integration test timeouts ec2: ephemeral network config on Ec2 when calling `cloud-init init --local ` on the commandline after cloud-init clean

Open blackboxsw opened this issue 1 year ago • 1 comments

Bug report

Recent commit f74b61eae24706dd313efc06792198a718f06672 provided support for EphemeralIPNetwork context manager for dhcpcd to bring down the link and del the associated address to avoid errors with network interface renames per #5113.

This introduced a side effect that this link teardown would be performed on anytime cloud-init init --local is run on the command line if the following conditions are met:

  • running in an image with dhcpcd installed as the preferred dhcp client instead of isc-dhclient. (Ubuntu Noble)
  • cloud-init clean is run to allow cloud-init to treat the instance as new
  • then cloud-init init --local on the CLI from the running instance on a cloud platform which relies oh EphemeralIPNetwork context manager to setup networking in init-local timeframe in order to contact the IMDS (Ec2 and GCP)

In this case, a CLI call to cloud-init clean; cloud-init init --local will attempt to rerun dhcpcd client to re-obtain IP information regardless of the fact that the interface was already configured and up. Once the DHCP lease is obtained from dhcpcd it'll bring the link down and delete the address associated with the network interface. This leaves the primary network down and causes integration tests to timeout resulting in integration test job failures.

Steps to reproduce the problem

Launch and ec2 Ubuntu Noble instance ssh into the VM and run cloud-init clean; cloud-init init --local; Expect connection to be dropped.

Or run the integration tests below against Ec2 or GCE

CLOUD_INIT_PLATFORM=ec2 tox -e -integration-tests -- tests/integration_tests/datasources/test_ec2_ipv6.py::test_dual_stack

- or -
  
CLOUD_INIT_PLATFORM=gce  tox -e -integration-tests -- tests/integration_tests/reporting/test_webhook_reporting.py

Environment details

  • Cloud-init version: 24.1.3
  • Operating System Distribution: Ubuntu
  • Cloud provider, platform or installer type: ec2

test

timeouts on jenkins jobs:

blackboxsw avatar Apr 12 '24 00:04 blackboxsw

Thanks for filing this issue @blackboxsw.

I am working on addressing this behavior change for our integration tests. Some users using cloud-init in unsupported ways may benefit from this work, but I don't want this effort to be misunderstood as endorsement of using cloud-init in this way.

We should be clear that this isn't an intended way of using cloud-init. Users should not be running cloud-init commands manually like this. Anyone running cloud-init manually in this way should understand that they are relying on undefined, untested, and unsupported behavior.

It happens to be useful to be able to do this in integration tests. As far as I'm concerned, the integration test fix is the only reason to address this change in behavior. Not changing behavior unnecessarily on users is a reasonable ideal. However, there are many side-effects from cloud-init commands that we simply cannot guarantee in a manual non-boot context. A user that runs this command manually and sees this behavior hasn't destroyed their system. All they've done is set a link down; recovery is simple (reboot/console/etc).

holmanb avatar Apr 12 '24 19:04 holmanb

This has been fixed, closing.

holmanb avatar May 29 '24 16:05 holmanb