gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Driver validation doesn't succeed because /usr/bin is a symlink

Open wokalski opened this issue 9 months ago • 6 comments

On my OS (NixOS) /usr/bin is a symlink to /run/current-system/sw/bin

[root@nvidia-operator-validator-9nk4h /]# ls -lah /host/usr/bin
lrwxrwxrwx 1 root root 26 Feb 16 13:09 /host/usr/bin -> /run/current-system/sw/bin

This means that the validation will always fail here: https://github.com/NVIDIA/gpu-operator/blob/e9b7ec16203987508eddc048fe82a4c44d3b392a/validator/main.go#L703C1-L704C1

If I'm not mistaken we should simply syscal.chroot there but I'm not sure if the following symlink creation would still work.

wokalski avatar Mar 23 '25 20:03 wokalski

@wokalski what is the behavior you are observing? From reading https://pkg.go.dev/os#Lstat, my understanding is that the os.Lstat() call would not return an error here.

cdesiniotis avatar Mar 28 '25 20:03 cdesiniotis

The pod has host fs mounted under /host so a symbolic link that resolves to /foo/bar doesn't work because the real location is /host/foo/bar.

wokalski avatar Mar 28 '25 20:03 wokalski

What are the validator pod logs? Based on the below from https://pkg.go.dev/os#Lstat, I was not expecting the link to be followed:

Lstat makes no attempt to follow the link

cdesiniotis avatar Mar 28 '25 20:03 cdesiniotis

There are no interesting logs cuz it's running silently. Notice that /usr/bin is a symlink and it's lstating /use/bin/nvidia-smi so it will never find it if it's not followed 😅. Right my initial description was confusing but the two things are interconnected.

And if it was followed file would not be found because the link would be pointing to a non existing path.

wokalski avatar Mar 28 '25 20:03 wokalski

Got it, thanks for the further clarification. The fact that os.Lstat does not follow symlinks means that it will think /host/usr/bin/nvidia-smi does not exist.

As you may already know, NixOS is not a supported OS. It may still be valuable to re-evaluate the validator code and and see if it can be tweaked to work on a system where /usr/bin is a symlink. I am not familiar with NixOS so I am not sure if there are other issues that would be encountered besides this.

cdesiniotis avatar Mar 28 '25 20:03 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]