bugs
bugs copied to clipboard
systemd-networkd-wait-online sometimes fails
Issue Report
Bug
Container Linux Version
Current master, probably all versions
Environment
qemu and azure, most likely others as well
Expected Behavior
systemd-networkd-wait-online.service starts successfully
Actual Behavior
Infrequently, it fails to start.
Reproduction Steps
Run any kola test that has something that Requires or Wants the unit repeatedly unit it fails (the docker tests are good examples of things to run)
Other Information
Looking at logs between the failed and passed tests, and from debugging live on an azure VM with @arithx it appears that on failed tests networkd doesn't say Gained IPv6LL whereas it does on tests that pass. This is probably a networkd bug, not a systemd-networkd-wait-online.service bug. networkctl never shows the link as configured when the tests fail.
I've been digging, gunna dump what I've found so far:
- Toggling the link up and down (e.g. with
ip link set eth0 down && ip link set eth0 up) fixes the issue - Links on machines that fail never get IPv6 addresses (unless the link is toggled)
- This branch is taken on machines that work but not on machines that do not.
- Adding some debug printing shows that machines that fail have the ip address
::(the unspecified address) whereas machines that do not havefe80...(link local) address already assigned. - It looks like the kernel should be assigning this when the link comes up but isn't for some reason.
- Adding some debug printing shows that machines that fail have the ip address
- Masking
system-networkd.serviceand runningip link set eth0 upbrings the link up and assigns the IPv6LL address. It's unclear to me at this moment if this is the kernel doing it automatically when the link comes up or ifipis issuing extra commands to do so. My hunch is the former.
Debugging techniques:
- Build
systemdwith extra logging enabled, scp thesystemd-networkdbinary to a failing instance, then bind mount it in. - Since you'll (probably) be debugging a networkd issue over ssh, take care to chain commands that bring down networkd with ones that bring it up. I recommend writing a "bind" and "unbind" script:
# bind.sh, assumes a systemd-networkd binary is at /tmp/systemd-networkd
systemctl stop systemd-networkd
mount -o bind /tmp/systemd-networkd /usr/lib/systemd/systemd-networkd
systemctl start systemd-networkd
# unbind.sh
systemctl stop systemd-networkd
umount /usr/lib/systemd/systemd-networkd
systemctl start systemd-networkd
- Run with the dropin:
# /etc/systemd/system/systemd-networkd.conf
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug
Other notes:
- I've yet to see it in the initramfs, but that could just be because it's infrequent. This also means that networkd starts in the initramfs just fine, but then fails in the real root.
- Kernel logs have nothing interesting.
Suddenly seeing this recently on latest coreos-stable. Any updates?
I also experienced it in an OpenStack environment when CoreOS was running in an instance and the other instance provided DHCP service with dnsmasq (dhcp was turned off on the OS subnet, plus I disabled port security to make this setup possible - it is not a production deployment). systemd-networkd-wait-online seem to timeout in 2 mins. Will check if it is possible to give it a higher timeout. Container Linux (CoreOS) 2023.5.0.
Then it got the IP by manually executing sudo dhcpcd -d -N -w. So it may be some transient problem?