Retrigger udev on failure to get device serial
What happened: We implicitly rely on udev to gather the serial ID of devices as they are added to the guest. If udev fails, for example due to transient networking issues on the underlying data path, it does not retry. We are therefore stuck with failed mounts
What you expected to happen: NodeStageVolume can retrigger udev if it cannot find the serial ID, allowing us to eventually succeed once the underlying networking issues are resolved.
How to reproduce it (as minimally and precisely as possible):
You can reproduce this in a slightly contrived way by adding a udev rule that will force udev to timeout, simulating a command timing out due to a networking issue. I did this by adding the following line to /usr/lib/udev/rules.d/60-persistent-storage.rules
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/bin/sleep 600"
After creating a pod and pvc on this tenant node, you can see mount errors in the pod events:
Warning FailedMount 10s (x6 over 27s) kubelet MountVolume.MountDevice failed for volume "pvc-0966c060-98fc-4037-93f9-3833b5874e98" : rpc error: code = Unknown desc = couldn't find device by serial id
and udev logs show that it quits:
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' is taking a long time
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' is taking a long time
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda'
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd
Checking back periodically I see that it does not retry, as expected since it is event driven.
Environment: Looking at the code I think this should happen in most envs and versions
Just so I understand properly, this problem happens in the tenant? And we can reproduce by adding that rule to udev in the tenant?
Yes this problem is in the tenant. Adding that udev rule to tenant nodes will allow you to reproduce, though depending on your guest OS I'd imagine the specific file path could be different
Okay, yeah, our tenant OS is ubuntu I believe in the CI lanes. I am going on PTO starting tomorrow, so it might be a bit until I get around to looking at this.
Thanks. I'm happy to contribute a fix here
If you have a fix please submit a PR, I can take a look. Again after I get back from PTO.