bugs icon indicating copy to clipboard operation
bugs copied to clipboard

systemd-networkd-wait-online sometimes fails

Open ajeddeloh opened this issue 7 years ago • 3 comments

Issue Report

Bug

Container Linux Version

Current master, probably all versions

Environment

qemu and azure, most likely others as well

Expected Behavior

systemd-networkd-wait-online.service starts successfully

Actual Behavior

Infrequently, it fails to start.

Reproduction Steps

Run any kola test that has something that Requires or Wants the unit repeatedly unit it fails (the docker tests are good examples of things to run)

Other Information

Looking at logs between the failed and passed tests, and from debugging live on an azure VM with @arithx it appears that on failed tests networkd doesn't say Gained IPv6LL whereas it does on tests that pass. This is probably a networkd bug, not a systemd-networkd-wait-online.service bug. networkctl never shows the link as configured when the tests fail.

ajeddeloh avatar Jan 29 '18 21:01 ajeddeloh

I've been digging, gunna dump what I've found so far:

  • Toggling the link up and down (e.g. with ip link set eth0 down && ip link set eth0 up) fixes the issue
  • Links on machines that fail never get IPv6 addresses (unless the link is toggled)
  • This branch is taken on machines that work but not on machines that do not.
    • Adding some debug printing shows that machines that fail have the ip address :: (the unspecified address) whereas machines that do not have fe80... (link local) address already assigned.
    • It looks like the kernel should be assigning this when the link comes up but isn't for some reason.
  • Masking system-networkd.service and running ip link set eth0 up brings the link up and assigns the IPv6LL address. It's unclear to me at this moment if this is the kernel doing it automatically when the link comes up or if ip is issuing extra commands to do so. My hunch is the former.

Debugging techniques:

  • Build systemd with extra logging enabled, scp the systemd-networkd binary to a failing instance, then bind mount it in.
  • Since you'll (probably) be debugging a networkd issue over ssh, take care to chain commands that bring down networkd with ones that bring it up. I recommend writing a "bind" and "unbind" script:
# bind.sh, assumes a systemd-networkd binary is at /tmp/systemd-networkd
systemctl stop systemd-networkd
mount -o bind /tmp/systemd-networkd /usr/lib/systemd/systemd-networkd
systemctl start systemd-networkd
# unbind.sh
systemctl stop systemd-networkd
umount /usr/lib/systemd/systemd-networkd
systemctl start systemd-networkd
  • Run with the dropin:
# /etc/systemd/system/systemd-networkd.conf
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

Other notes:

  • I've yet to see it in the initramfs, but that could just be because it's infrequent. This also means that networkd starts in the initramfs just fine, but then fails in the real root.
  • Kernel logs have nothing interesting.

ajeddeloh avatar Mar 05 '18 20:03 ajeddeloh

Suddenly seeing this recently on latest coreos-stable. Any updates?

llamahunter avatar Feb 05 '19 04:02 llamahunter

I also experienced it in an OpenStack environment when CoreOS was running in an instance and the other instance provided DHCP service with dnsmasq (dhcp was turned off on the OS subnet, plus I disabled port security to make this setup possible - it is not a production deployment). systemd-networkd-wait-online seem to timeout in 2 mins. Will check if it is possible to give it a higher timeout. Container Linux (CoreOS) 2023.5.0. Then it got the IP by manually executing sudo dhcpcd -d -N -w. So it may be some transient problem?

attila123 avatar Mar 26 '19 17:03 attila123