Ubuntu: cloud-init.service order After=NetworkManager.service not possible with Before=sysinit.target
This bug was originally filed in Launchpad as LP: #2015949
Launchpad details
affected_projects = [] assignee = None assignee_name = None date_closed = None date_created = 2023-04-12T03:50:22.316150+00:00 date_fix_committed = None date_fix_released = None id = 2015949 importance = medium is_complete = False lp_url = https://bugs.launchpad.net/cloud-init/+bug/2015949 milestone = None owner = chad.smith owner_name = Chad Smith private = False status = triaged submitter = chad.smith submitter_name = Chad Smith tags = [] duplicates = []
Launchpad user Chad Smith(chad.smith) wrote on 2023-04-12T03:50:22.316150+00:00
For Ubuntu Desktop images which prefer NetworkManager as the primary network configuration service, provide a mechanism by which cloud-init.service can be ordered After=NetworkManager.service and/or NetworkManager-wait-online.service.
Use case: The Ubuntu desktop live installer ISO prefers using NetworkManager as the primary network backend and cloud-init must order After=NetworkManager.service in these cases to avoid DNS-related bugs during datasource discovery and downloading user-data such as LP: #2008952.
Issue:
Upstream Ubuntu packaging of systemd cloud-init.service file declares ordering as After=systemd-networkd-wait-online.target[1] and Before=sysinit.target[2]. Adding an new After=NetworkManager.service creatd a systemd ordering cycle which results in cloud-init.service being kicked out of desired systemd boot target goals. The ordering cycle is due to NetworkManager.service After=dbus.socket and cloud-init.service declaring Before=sysinit.target being incompatible.
Fix Proposal: Short-term fix is released which provides an override for cloud-init.service in the livecd-rootfs project[3]
Mid-term need is to provide an environmental artifact or mechanism at systemd-generator timeframe to allow cloud-init.service to order After=NetworkManager.service and drop Before=sysinit.target for that use-case.
Since NetworkManager.service is After=sysinit.target due to After=dbus.service ordering, cloud-init.service would have to drop it's Before=sysinit.target declarations in order to avoid systemd ordering cycles punting cloud-init out of the boot target.
Long-term want: Ideally, we may want to see NetworkManager.service support for systemd ordering Before=sysinit.target, but that may involve NetworkManager growing the ability to plugin to dbus.service/socket/broker if dbus shows up later than NetworkManager.service. Upstream systemd-networkd made this shift to late-bind to dbus broker as discussed in LP: #1636912 which were eventually accepted for systemd-networkd.service[4][5].
But NetworkManager growing support for earlier boot before dbus.service is probably a longer term goal for NetworkManager than cloud-init.service allowing flexibility at systemd generator timeframe to prefer NetworkManager over networkd for certain images/environments.
[1] https://github.com/canonical/cloud-init/blob/main/systemd/cloud-init.service.tmpl#L11 [2] https://github.com/canonical/cloud-init/blob/main/systemd/cloud-init.service.tmpl#L33 [3] livecd-rootfs cloud-init.service overrides https://code.launchpad.net/~chad.smith/livecd-rootfs/+git/livecd-rootfs/+merge/439586 [4] functional changes allowing networkd to set hostname at some point after networkd start when dbus service shows up https://github.com/systemd/systemd/pull/4710 [5] networkd dropping After=dbus.service ordering https://github.com/systemd/systemd/issues/4504
Launchpad user Brett Holman(holmanb) wrote on 2023-04-13T15:13:28.088306+00:00
Mid-term need is to provide an environmental artifact or mechanism at systemd-generator timeframe to allow cloud-init.service to order After=NetworkManager.service and drop Before=sysinit.target for that use-case.
How broad are we considering this use-case? Any image that uses NetworkManager? Only some specialized NoCloud images? Something else?
This change would cause cloud-init to no longer be blocking "as much of the remaining boot as possible"[1].
Dropping Before=sysinit.target from cloud-init.service could cause other services later in boot that are expecting cloud-init.service to be done by sysinit.target to fail. We could easily test base images, however I don't think this would be sufficient, since any package could provide a service that is ordered After=sysinit.target. Any service that currently orders after sysinit.target and expects cloud-init mounts/disk setup to be complete, for example, could be broken by the proposed change.
[1] https://cloudinit.readthedocs.io/en/latest/explanation/boot.html#network
I encountered this problem too with version Cloud-init v. 25.1.2-0ubuntu0~24.04.1. The cloud-init network service was removed from the bootup sequence due to an ordering cycle.
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found ordering cycle on cloud-init.service/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on systemd-networkd-wait-online.service/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on systemd-networkd.service/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on network-pre.target/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on dkms.service/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on basic.target/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on sockets.target/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on acpid.socket/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Jun 09 22:25:57 ip-172-31-69-169 systemd[1]: sysinit.target: Job cloud-init.service/start deleted to break ordering cycle starting with sysinit.target/start
I attempted to remove the After=systemd-networkd-wait-online.service from the service file (see See https://github.com/canonical/cloud-init/pull/5772/files#diff-74dde4c2e47accf6df60f93bba5b7397523eb5050a8fa37e968531571aa66399R12-R14
), but sometimes the EC2 failed to come up. When removing that line from the service file, it needs to be paired with the waits in the code implemented in the same PR.
I can confirm that removing Before=sysinit.target results in a bootup sequence that passes consistently during my testing.
This problem occurs in Debian 13, after upgrading from 12, and reinstalling cloud-init
Seeing this with cloud-init 25.2 on Ubuntu 24.04.
I tried masking systemd-networkd-wait-online, but the result was cloud-init hanging forever on the Networking stage.
I tried replacing the networkd-wait with NetworkManager-wait in cloud-init.service, but it turns out that pulls sysinit.target back in, again causing cloud-init to not run. Removing any After= for a wait-online service and the Before=sysinit.target resolved the cycle, but--again--cloud-init just hangs in the Networking stage forever.
Honestly it would all be fine if networkd would acknowledge that NICs it doesn't manage can still be online.
@benvandenberg Your issue appears to be unrelated to NetworkManager. Based on the information that you provided I believe that your issue is actually related to an issue with dkms. See the discussion in the linked bug for possible fixes.
@MajorDallas Can you please add some more details? It is unclear to me if your system is using systemd or NetworkManager. Is this a server or desktop image? Please include journal logs of the dependency cycle.
I must apologize, I don't have the logs from that attempt--the VM has already been deleted.
Without going into unnecessary details, I was trying to troubleshoot some unexpected behavior with NICs retaining the MAC addresses from the base image and failing to get new DHCP leases as a result. In the course of that, I touched systemd-networkd configuration, NetworkManager configuration, Netplan, and a whole bunch of other stuff. Among the changes I made was an override for systemd-networkd-wait-online that would consistently fail for reasons I've been unable to determine. That failure was what was causing cloud-init to hang forever; as soon as I eliminated After=sysinit.target and reverted the wait-online configuration to defaults, it started to work consistently.
It is unclear to me if your system is using systemd or NetworkManager.
Honestly, me, too 😅 . In the course of trying to get this to work (create a "golden image" with CML2 installed but not configured), I've determined that when the clone starts up, networkd is responsible for everything, but by the time CML2 has finished its setup process everything has moved over to NetworkManager. NetworkManager is installed in the golden image and does have some configuration set, but during the first boot networkd remains the authority.
Is this a server or desktop image?
The base image is Ubuntu 24.04 Server, but the golden image has some additions--mostly CML2's dependencies (including NetworkManager and Firewalld).
Update:
I've been trying to track down exactly where the cycle starts, as neither NetworkManager nor its wait-online explicitly set a direct ordering relation with sysinit.target. Nevertheless, every systemd tool I know for analyzing this shows that NM-wait-online orders after sysinit. NM-wait-online's unit file does have Requires=sysinit.target, but no order relation. systemctl list-dependencies --after NetworkManager.service reflects a direct dependency on sysinit.target and a transitive dependency to it through basic.target (which is also not declared in NM's unit files). I'm a bit at a loss in trying to make it so putting renderer: NetworkManager in the meta-data netplan "just works," but as it is that only causes cloud-init to hang forever after systemd-networkd-wait-online fails.
One more:
The server does not have dkms installed at all, although the symptoms in that linked issue are fairly similar.
I made another attempt at replacing networkd-wait with nm-wait, using the following override:
# /etc/systemd/system/cloud-init.service.d/override.conf
[Unit]
After=
After=NetworkManager-wait-online.service
After=NetworkManager.service
After=cloud-init-local.service
Before=
Before=network-online.target
Before=sshd-keygen.service
Before=sshd.service
Before=systemd-user-sessions.service
Before=shutdown.target
The result was that networkd-wait would eventually fail, then systemd would remove cloud-init.service to repair the cycle on sysinit.target through NetworkManager.service:
I think sysinit.target is getting pulled in via BindsTo=dbus.service in NetworkManager.service. dbus is after both sysinit.target and basic.target. That seems like a pretty big problem, and sadly not one that cloud-init can do anything about.