Netplan/Systemd/Cloud-init/Dbus Race
This bug was originally filed in Launchpad as LP: #1997124
Launchpad details
affected_projects = ['netplan', 'systemd (Ubuntu)'] assignee = falcojr assignee_name = James Falcon date_closed = None date_created = 2022-11-18T23:07:12.002484+00:00 date_fix_committed = None date_fix_released = None id = 1997124 importance = high is_complete = False lp_url = https://bugs.launchpad.net/cloud-init/+bug/1997124 milestone = None owner = holmanb owner_name = Brett Holman private = False status = in_progress submitter = holmanb submitter_name = Brett Holman tags = [] duplicates = []
Launchpad user Brett Holman(holmanb) wrote on 2022-11-18T23:07:12.002484+00:00
Cloud-init is seeing intermittent failures while running netplan apply, which appears to be caused by a missing resource at the time of call.
The symptom in cloud-init logs looks like:
Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory
I think that this error[1] is likely caused by cloud-init running netplan apply too early in boot process (before dbus is active).
Today I stumbled upon this error which was hit in MAAS[2]. We have also hit it intermittently during tests (we didn't have a reproducer).
Realizing that this may not be a cloud-init error, but possibly a dependency bug between dbus/systemd we decided to file this bug for broader visibility to other projects.
I will follow up this initial report with some comments from our discussion earlier.
[1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801 [2] https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970
Launchpad user Brett Holman(holmanb) wrote on 2022-11-18T23:23:58.106907+00:00
Some details from a conversation with Chad, James, and Vorlon.
netplan apply is executed in cloud-init.service, which runs
Before=network-online.target but After=systemd-networkd-wait-online.service.
There is may be a dependency bug between dbus and systemd-networkd because systemd-networkd is a dbus service so when it's "up" it should be accessible over dbus.
Should cloud-init or systemd-networkd-wait-online.service require being ordered after dbus.service?
Launchpad user Dimitri John Ledkov(xnox) wrote on 2022-11-19T00:41:26.055748+00:00
They should be ordered after dbus.socket, which should be enough to activate those services. However, they themselves will enqueu themselves into Systemd startup sequence .
Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.
Launchpad user Dimitri John Ledkov(xnox) wrote on 2022-11-19T00:42:31.554536+00:00
Note this bug was opened against upstream projects. Systemd does not use launchpad for bug tracking, did you mean to mark Ubuntu(Systemd) as affected?
Launchpad user Brett Holman(holmanb) wrote on 2022-11-19T04:59:15.939528+00:00
Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.
+1
did you mean to mark Ubuntu(Systemd) as affected?
Yes, I'll update that thanks.
Launchpad user Chad Smith(chad.smith) wrote on 2022-12-05T15:14:30.032351+00:00
Confirmed that we need dbus.socket in cloud-init.service as the dependency chain doesn't explicitly define that ordering dependency. We'll need this for netplan apply to work without a race
Launchpad user Launchpad Janitor(janitor) wrote on 2022-12-08T10:02:31.858732+00:00
Status changed to 'Confirmed' because the bug affects multiple users.
Launchpad user Rod Smith(rodsmith) wrote on 2023-01-05T01:40:10.844534+00:00
We've run into this in the Server Certification lab on mouser, a NEC Express5800/R128h-1M server. In our testing, it's affected 50% (2 of 4) Ubuntu 20.04 deployments, but not 18.04 or 22.04 deployments (0 of 3 for each of those). These sample sizes are low, so this may be a coincidence; but I'm mentioning it here in case it's not a coincidence.
Launchpad user Chad Smith(chad.smith) wrote on 2023-01-20T06:25:43.348024+00:00
The dbus race that is happening here is due to networkctl reconfigure[1] being run by netplan apply, failing to talk to dbus, and restarting systemd_networkd[2] at that point in time when systemd_network may actually be coming up and is in an indeterminate state.
[1] https://github.com/canonical/netplan/blob/main/netplan/cli/utils.py#L116 [2] https://github.com/canonical/netplan/blob/main/netplan/cli/commands/apply.py#L277
I'm guessing the restart here from netplan apply is what's triggering the occasional failure case where not all network config is applied (like IP addresses) in systemd-networkd. It doesn't happen all the time but it's racy as systemd-networkd is mid startup and we're restarting it again via netplan apply.
After discussion with waldi (Bastian Blank) in Debian land about the systemd dependency chain, it seems my suggestion about about adding dbus.socket to cloud-init.service will actually introduce an ordering cycle because dbus.socket is After=sysinit.target, yet cloud-init.service is Before=sysinit.target.
So, trying to shoehorn cloud-init into the dependency chain After=dbus.socket is impossible for systemd to schedule.
Maybe, we'd want one of the following instead:
netplan applyprovide an option to avoid falling back tonetworkctl reconfigureand exit non-zero so cloud-init can do something better, or retry where necessarynetplan applycan defer or block/retry until dbus.socket/service is ready allowing this only to affect cases where netplan apply is called- cloud-init to defer calling netplan apply on systemd-networkd environments until later boot stage (cloud-config.service) which comes after sysinit.target (and therefore can expect dbus.socket to be started at that point in boot.
I'll add netplan here to see if there are thoughts or counter suggestions here.
Launchpad user Lukas Märdian(slyon) wrote on 2023-01-25T15:45:38.879245+00:00
I think the "Failed to connect system bus: No such file or directory" stderr output rather comes from networkctl [1] than from "netplan-dbus" (Netplan's output would be "... connect TO system bus..."). netplan-dbus is not involved at all AFAICS, as cloud-init is calling into the "netplan apply" CLI and not calling its "io.netplan.Netplan Apply()" DBus method; which would fail due to missing DBus communication, too.
So the root-cause IMO is networkctl trying to talk to systemd-networkd via DBus, which is not yet ready. Porting this communication to using varlink instead of dbus could solve this (but is probably a big task). Are we sure that systemd-networkd.service is already up-and-running at this stage and dbus.service/.socket being the bottleneck? We're sorting After=systemd-networkd-wait-online.service, so I assume: Yes.
Netplan's "apply" CLI could probably implement a "systemctl is-active ..." check for dbus.service/.socket and/or systemd-networkd.service/NetworkManager.service (depending on which backend is about to be (re-)configured. But generally "netplan apply" is designed to be a userspace tool and only Netplan's generator is designed to be executed during early boot. So if it's possible to postpone the execution of "netplan apply" until after systemd's initial boot transaction finished (i.e. into cloud-config.service) this would IMO be the cleaner solution and could avoid similar, future issues related to early boot.
[1] https://github.com/systemd/systemd/blob/main/src/network/networkctl.c#L2992
I am seeing a similar issue. If there are troubleshooting steps or logs that I can provide, I would be happy to do so with some guidance. The error: activators.py: Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system to bus: no such file or directory seemingly leads to a timeout for cloud-init on my systems.
Thanks for contributing to cloud-init. To collect the logs:
sudo cloud-init collect-logs and attach the tarball to this issue, making sure sensitive information is redacted.
I got this error consistently trying to build an image using packer qemu builder. Attaching the cloud init logs gathered with cloud-init collect-logs