cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

Interfaces Appear as Down and Without Network Lease when using NetworkManager as Netplan Renderer Despite Functional Network

Open bryanfraschetti opened this issue 9 months ago • 5 comments

Bug report

When using NetworkManager as the network renderer the networking summary tables (both the interface/ip/HW address table and the routing table) show incorrect information. Although there is no functional issue (the network spins up and works as expected), seemingly, the tables imply that the only active interface is the loopback interface, and do not display the leases of other devices. Upon further investigation by logging the output of the ip --json addr command, it is actually revealed that at the point in time when the tables are populated the interfaces actually are DOWNed. From the journal, NetworkManager appears to bring up the ifaces while cloud-init is in its later stages of booting and actually succeeds at obtaining leases after cloud-init finishes executing. However, this is not the case when using networkd, where the interfaces are UP when the table is created.

I suspect the problem could be fixed through some mechanism that allows NetworkManager to bring up the interfaces earlier but also wonder if the current design choice is intentional due to NetworkManager's reliance on DBUS and how that can easily conflict with other processes.

Output of interfaces table with networkd:

ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | enp5s0 | True |        10.70.162.216        | 255.255.255.0 | global | 00:16:3e:12:f8:27 |
ci-info: | enp5s0 | True | fe80::216:3eff:fe12:f827/64 |       .       |  link  | 00:16:3e:12:f8:27 |
ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |           ::1/128           |       .       |  host  |         .         |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+

Output with NetworkManager

ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | enp5s0 | False |     .     |     .     |   .   | 00:16:3e:12:f8:27 |
ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+

Steps to reproduce the problem

# Create a VM (by default this uses networkd)
lxc launch --vm ubuntu:jammy renderer-nm
lxc shell renderer-nm

# For debugging purposes
cloud-init collect-logs -t cloud-init-networkd.tar.gz

# Enable the system to use NetworkManager
snap install network-manager
systemctl disable --now systemd-networkd

cat <<EOF > /etc/netplan/00-netplan.yaml
network:
    renderer: NetworkManager
    version: 2
    ethernets:
        enp5s0:
            dhcp4: true
EOF

netplan apply

# Reboot having switched to NetworkManager
reboot
lxc shell renderer-nm
cloud-init collect-logs -t cloud-init-network-manager.tar.gz

Environment details

Cloud-init version: 24.4.1-0ubuntu0~22.04.2 Operating System Distribution: Jammy Cloud provider, platform or installer type: OpenStack, CloudStack, LXD, presumably others

cloud-init logs

Attached files outputted from cloud-init collect-logs.

cloud-init-networkd.tar.gz cloud-init-network-manager.tar.gz

Note that when inspecting the cloud-init collect-logs of the NetworkManager environment, it is important to go to the bottom of the files. For example if you look at cloud-init-output.log, it looks correct but that information is there from the previous networkd boot. At the bottom the incomplete tables are present.

bryanfraschetti avatar Apr 11 '25 19:04 bryanfraschetti

Hi @bryanfraschetti, thanks for reporting.

Yes, NetworkManager is ordered after dbus and therefore configures network devices later in boot than networkd.

The concern is that the device information printed to the serial device does not include the configured addresses. Is that correct? This information is very helpful in debugging no-boot scenarios It would be helpful to understand the negative impact on your use case when the instance is otherwise correctly configured. Can you please elaborate?

If cloud-init were to log this information again at a later stage when the devices are up, would that address this issue for you?

holmanb avatar Apr 11 '25 20:04 holmanb

Hi @holmanb,

You're exactly right about the concern being that the output to the serial device not including configured addresses. Essentially we have a customer and they are interpreting the table as an informational/result level output for the end user. FWIW I also thought the table was meant to be viewed as a final result rather than debugging information. Under this assumption, when the table does not include the IP addresses it gives them the impression that the network is not functioning. At the same time they do appreciate that the network is functional but then that means they perceive the disagreement between behaviour and logs as a bug.

As far as a specific negative impact, I think the discrepancy creates some confusion "Is the network up or down? Why does cloud-init think the network is down?"

I suspect that if the table was logged again at a later stage they would view that as resolving the problem. Though, I wonder if the table being printed twice may end up creating confusion for another customer.

Thanks for the help on this so far. I can go back to the customer and let them know that the purpose of the table is for debugging certain types of boot failures and see if that's an acceptable explanation.

bryanfraschetti avatar Apr 11 '25 21:04 bryanfraschetti

Hi Bryan / Brett,

Thanks for chasing this down and for the detailed explanation.

I agree with Bryan's answer. I always saw that table as informational, and interpreted it as the final state of the instance. If I don't see the network in that table, my immediate thought is that cloud-init failed to configure the network, and I imagine other customers would interpret it the same way, so this is a bit confusing.

I would prefer that we fix the table output instead of explaining back to the customer that the table is meant for debugging certain types of boot failure.

If the purpose is solely for debugging these types of boot failure, I feel we should not display the table in the console by default, and send it to debug logs, such as cloud-init.log instead, and maybe have a cmdline option to opt-in and display the table (along side debugging information) to the console by default. That would be less misleading than displaying the incorrect information by default (for certain use cases -- NetworkManager).

Displaying the table again by the end might be OK, and possibly acceptable by the customer, but I think as Bryan said "table being printed twice may end up creating confusion for another customer". Maybe there's something we can display before each table is displayed to explain what we are showing there? Such as "Devices state before d-bus" and then "Devices state after d-bus".

Anyway, my opinion is that displaying a table without networking information when the network is working sounds like a bug and causes confusion to cloud-init users.

fabiomirmar avatar Apr 15 '25 12:04 fabiomirmar

I always saw that table as informational, and interpreted it as the final state of the instance.

Out of curiosity, does any documentation suggest that?

not display the table in the console by default, and send it to debug logs, such as cloud-init.log instead, and maybe have a cmdline option to opt-in and display the table (along side debugging information) to the console by default

I disagree. When network configuration fails, in many cases the instance is inaccessible to the user. Requiring login to extract this information would severely impede debugging cloud-init.

table being printed twice may end up creating confusion for another customer

Maybe, but its easily explainable.

Anyway, my opinion is that displaying a table without networking information when the network is working sounds like a bug and causes confusion to cloud-init users.

Agreed. We could conditionally print only once before / after depending on whether NetworkManager is used. Feel free to submit a PR if you would like to see this happen.

holmanb avatar Apr 15 '25 14:04 holmanb

Out of curiosity, does any documentation suggest that?

Not that I'm aware of. It's more about perception, and apparently I'm not the only one.

I disagree. When network configuration fails, in many cases the instance is inaccessible to the user. Requiring login to extract this information would severely impede debugging cloud-init.

That's why I suggested the opt-in cmdline option to display that information to the console.

To be clear here, I'm not suggesting that we change the table to cloud-init.log, but rather that we display the right networking information when we use NetworkManager. I'm just saying that IF the table is not informational, but rather have a debug purpose, maybe it's right place would be somewhere else.

fabiomirmar avatar Apr 15 '25 15:04 fabiomirmar