cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

Netplan/Systemd/Cloud-init/Dbus Race

Open ubuntu-server-builder opened this issue 2 years ago • 12 comments

This bug was originally filed in Launchpad as LP: #1997124

Launchpad details
affected_projects = ['netplan', 'systemd (Ubuntu)']
assignee = falcojr
assignee_name = James Falcon
date_closed = None
date_created = 2022-11-18T23:07:12.002484+00:00
date_fix_committed = None
date_fix_released = None
id = 1997124
importance = high
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1997124
milestone = None
owner = holmanb
owner_name = Brett Holman
private = False
status = in_progress
submitter = holmanb
submitter_name = Brett Holman
tags = []
duplicates = []

Launchpad user Brett Holman(holmanb) wrote on 2022-11-18T23:07:12.002484+00:00

Cloud-init is seeing intermittent failures while running netplan apply, which appears to be caused by a missing resource at the time of call.

The symptom in cloud-init logs looks like:

Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory

I think that this error[1] is likely caused by cloud-init running netplan apply too early in boot process (before dbus is active).

Today I stumbled upon this error which was hit in MAAS[2]. We have also hit it intermittently during tests (we didn't have a reproducer).

Realizing that this may not be a cloud-init error, but possibly a dependency bug between dbus/systemd we decided to file this bug for broader visibility to other projects.

I will follow up this initial report with some comments from our discussion earlier.

[1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801 [2] https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Brett Holman(holmanb) wrote on 2022-11-18T23:23:58.106907+00:00

Some details from a conversation with Chad, James, and Vorlon.

netplan apply is executed in cloud-init.service, which runs Before=network-online.target but After=systemd-networkd-wait-online.service.

There is may be a dependency bug between dbus and systemd-networkd because systemd-networkd is a dbus service so when it's "up" it should be accessible over dbus.

Should cloud-init or systemd-networkd-wait-online.service require being ordered after dbus.service?

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Dimitri John Ledkov(xnox) wrote on 2022-11-19T00:41:26.055748+00:00

They should be ordered after dbus.socket, which should be enough to activate those services. However, they themselves will enqueu themselves into Systemd startup sequence .

Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Dimitri John Ledkov(xnox) wrote on 2022-11-19T00:42:31.554536+00:00

Note this bug was opened against upstream projects. Systemd does not use launchpad for bug tracking, did you mean to mark Ubuntu(Systemd) as affected?

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Brett Holman(holmanb) wrote on 2022-11-19T04:59:15.939528+00:00

Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.

+1

did you mean to mark Ubuntu(Systemd) as affected?

Yes, I'll update that thanks.

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Chad Smith(chad.smith) wrote on 2022-12-05T15:14:30.032351+00:00

Confirmed that we need dbus.socket in cloud-init.service as the dependency chain doesn't explicitly define that ordering dependency. We'll need this for netplan apply to work without a race

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Launchpad Janitor(janitor) wrote on 2022-12-08T10:02:31.858732+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Rod Smith(rodsmith) wrote on 2023-01-05T01:40:10.844534+00:00

We've run into this in the Server Certification lab on mouser, a NEC Express5800/R128h-1M server. In our testing, it's affected 50% (2 of 4) Ubuntu 20.04 deployments, but not 18.04 or 22.04 deployments (0 of 3 for each of those). These sample sizes are low, so this may be a coincidence; but I'm mentioning it here in case it's not a coincidence.

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Chad Smith(chad.smith) wrote on 2023-01-20T06:25:43.348024+00:00

The dbus race that is happening here is due to networkctl reconfigure[1] being run by netplan apply, failing to talk to dbus, and restarting systemd_networkd[2] at that point in time when systemd_network may actually be coming up and is in an indeterminate state.

[1] https://github.com/canonical/netplan/blob/main/netplan/cli/utils.py#L116 [2] https://github.com/canonical/netplan/blob/main/netplan/cli/commands/apply.py#L277

I'm guessing the restart here from netplan apply is what's triggering the occasional failure case where not all network config is applied (like IP addresses) in systemd-networkd. It doesn't happen all the time but it's racy as systemd-networkd is mid startup and we're restarting it again via netplan apply.

After discussion with waldi (Bastian Blank) in Debian land about the systemd dependency chain, it seems my suggestion about about adding dbus.socket to cloud-init.service will actually introduce an ordering cycle because dbus.socket is After=sysinit.target, yet cloud-init.service is Before=sysinit.target.

So, trying to shoehorn cloud-init into the dependency chain After=dbus.socket is impossible for systemd to schedule.

Maybe, we'd want one of the following instead:

  1. netplan apply provide an option to avoid falling back to networkctl reconfigure and exit non-zero so cloud-init can do something better, or retry where necessary
  2. netplan apply can defer or block/retry until dbus.socket/service is ready allowing this only to affect cases where netplan apply is called
  3. cloud-init to defer calling netplan apply on systemd-networkd environments until later boot stage (cloud-config.service) which comes after sysinit.target (and therefore can expect dbus.socket to be started at that point in boot.

I'll add netplan here to see if there are thoughts or counter suggestions here.

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

Launchpad user Lukas Märdian(slyon) wrote on 2023-01-25T15:45:38.879245+00:00

I think the "Failed to connect system bus: No such file or directory" stderr output rather comes from networkctl [1] than from "netplan-dbus" (Netplan's output would be "... connect TO system bus..."). netplan-dbus is not involved at all AFAICS, as cloud-init is calling into the "netplan apply" CLI and not calling its "io.netplan.Netplan Apply()" DBus method; which would fail due to missing DBus communication, too.

So the root-cause IMO is networkctl trying to talk to systemd-networkd via DBus, which is not yet ready. Porting this communication to using varlink instead of dbus could solve this (but is probably a big task). Are we sure that systemd-networkd.service is already up-and-running at this stage and dbus.service/.socket being the bottleneck? We're sorting After=systemd-networkd-wait-online.service, so I assume: Yes.

Netplan's "apply" CLI could probably implement a "systemctl is-active ..." check for dbus.service/.socket and/or systemd-networkd.service/NetworkManager.service (depending on which backend is about to be (re-)configured. But generally "netplan apply" is designed to be a userspace tool and only Netplan's generator is designed to be executed during early boot. So if it's possible to postpone the execution of "netplan apply" until after systemd's initial boot transaction finished (i.e. into cloud-config.service) this would IMO be the cleaner solution and could avoid similar, future issues related to early boot.

[1] https://github.com/systemd/systemd/blob/main/src/network/networkctl.c#L2992

ubuntu-server-builder avatar May 12 '23 20:05 ubuntu-server-builder

I am seeing a similar issue. If there are troubleshooting steps or logs that I can provide, I would be happy to do so with some guidance. The error: activators.py: Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system to bus: no such file or directory seemingly leads to a timeout for cloud-init on my systems.

gary-sixgen avatar Oct 03 '23 14:10 gary-sixgen

Thanks for contributing to cloud-init. To collect the logs:

sudo cloud-init collect-logs and attach the tarball to this issue, making sure sensitive information is redacted.

aciba90 avatar Oct 05 '23 09:10 aciba90

I got this error consistently trying to build an image using packer qemu builder. Attaching the cloud init logs gathered with cloud-init collect-logs

cloud-init.tar.gz

karlbohlmark avatar Feb 11 '25 17:02 karlbohlmark