csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

Retrigger udev on failure to get device serial

Open ryanpgoogle opened this issue 2 months ago • 5 comments

What happened: We implicitly rely on udev to gather the serial ID of devices as they are added to the guest. If udev fails, for example due to transient networking issues on the underlying data path, it does not retry. We are therefore stuck with failed mounts

What you expected to happen: NodeStageVolume can retrigger udev if it cannot find the serial ID, allowing us to eventually succeed once the underlying networking issues are resolved.

How to reproduce it (as minimally and precisely as possible): You can reproduce this in a slightly contrived way by adding a udev rule that will force udev to timeout, simulating a command timing out due to a networking issue. I did this by adding the following line to /usr/lib/udev/rules.d/60-persistent-storage.rules

KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/bin/sleep 600"

After creating a pod and pvc on this tenant node, you can see mount errors in the pod events:

Warning  FailedMount             10s (x6 over 27s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-0966c060-98fc-4037-93f9-3833b5874e98" : rpc error: code = Unknown desc = couldn't find device by serial id

and udev logs show that it quits:

Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' is taking a long time
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' is taking a long time
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda'
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd

Checking back periodically I see that it does not retry, as expected since it is event driven.

Environment: Looking at the code I think this should happen in most envs and versions

ryanpgoogle avatar Oct 03 '25 17:10 ryanpgoogle

Just so I understand properly, this problem happens in the tenant? And we can reproduce by adding that rule to udev in the tenant?

awels avatar Oct 03 '25 17:10 awels

Yes this problem is in the tenant. Adding that udev rule to tenant nodes will allow you to reproduce, though depending on your guest OS I'd imagine the specific file path could be different

ryanpgoogle avatar Oct 03 '25 17:10 ryanpgoogle

Okay, yeah, our tenant OS is ubuntu I believe in the CI lanes. I am going on PTO starting tomorrow, so it might be a bit until I get around to looking at this.

awels avatar Oct 03 '25 18:10 awels

Thanks. I'm happy to contribute a fix here

ryanpgoogle avatar Oct 03 '25 18:10 ryanpgoogle

If you have a fix please submit a PR, I can take a look. Again after I get back from PTO.

awels avatar Oct 03 '25 18:10 awels