gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Toolkit DaemonSet stuck in init phase after upgrade

Open heilerich opened this issue 2 years ago • 7 comments

Symptoms

After the upgrade to v23.6.0 of the operator all deployments are stuck because the nvidia-container-toolkit-daemonset DaemonSet stays in the Init:0/1 phase indefinitely. The logs of the init stage indicate that the driver-validation container is trying to execute modprobe on the nvidia kmod which fails because the container does not ship with the kernel modules (they are loaded by the driver daemonset).

Issue

I believe the issue is cause by the fixes to #430 / #485 introduced in 84ef9b3. The new module for creating the char devices also has a function to load the kernel modules. With 8906259 (ping @elezar) this is explicitly activated here

https://github.com/NVIDIA/gpu-operator/blob/25d6f8d0a6f12b020b0bbc4d60ab95ccab77f5d0/validator/main.go#L717

As mentioned above I believe that this can't work in the validation container unless I am missing something and I also do not understand why it would be necessary since the loading of the modules is handled by the driver daemonset.

Proposed solution

I think the code above in the validators main.go should be set to devchar.WithLoadKernelModules(false). We have deployed a copy of the v23.6.0 container with this patch in our environment and everything seems to work fine.

diff --git a/validator/main.go b/validator/main.go
index 4742834d..9fee8103 100644
--- a/validator/main.go
+++ b/validator/main.go
@@ -714,7 +714,7 @@ func createDevCharSymlinks(isHostDriver bool, driverRoot string) error {
                devchar.WithDevCharPath(hostDevCharPath),
                devchar.WithCreateAll(true),
                devchar.WithCreateDeviceNodes(true),
-               devchar.WithLoadKernelModules(true),
+               devchar.WithLoadKernelModules(false),
        )

Related

Issue #552 might also be caused by this. There v23.3.2 is used which is not setting the devchar.WithLoadKernelModules explicitly, but I think the default value in this version might be true. I have not verified this since we have never used v23.3.2.

heilerich avatar Aug 09 '23 18:08 heilerich

@cdesiniotis @shivamerla any thoughts on this? Would exposing this as a config option (possibly changing the default back to the previous value) make sense?

@heilerich do you have logs available that show the failure to load the kernel modules?

elezar avatar Aug 09 '23 20:08 elezar

Log file looks like this

time="2023-08-09T20:58:41Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-09T20:58:41Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.122-flatcar\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

heilerich avatar Aug 09 '23 20:08 heilerich

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

@heilerich this was added for cases with pre-installed driver, where all necessary modules might not be loaded when a GPU pod is scheduled or during node reboot. With driver-container, we can skip this.

@elezar yes, it makes sense to add an option to disable this.

shivamerla avatar Aug 10 '23 02:08 shivamerla

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

shivamerla avatar Aug 10 '23 02:08 shivamerla

@heilerich just to confirm in your case driver is pre-installed on the node with Flatcar Linux?

No the driver is installed using the driver-container

we set the driverRoot to either /host or /run/nvidia/driver which the nvidia-ctk library should chroot to and find nvidia modules, wondering why it is not able to find them.

Ah, I totally missed the chroot. That must be it. The flatcar driver-container does not copy the modules to /lib/modules in its root filesystem, but loads it from /opt/nvidia/${DRIVER_VERSION} using modprobe -b. That's why the modprobe fails even when chrooting into the driver root. So bascially this is a mismatch between the expecatation of how the (maintained) driver containers should look like and how the flatcar driver container actually works (I realise the flatcar container is not officially supported). So I think we can fix this by patching the driver container, of which we already maintain a fork because of various other issues.

@elezar yes, it makes sense to add an option to disable this.

I guess we can close this issue, unless you want to still add said option. I still think this would make sense, at least I would use it since this step is clearly unnecessary in our environment and a potential source of problems.

I also want to note that there does not seem to be an option to change the log level for the validator cmd (unless I missed something again). Would have also been helpful here :-)

Anyways, I appreciate the quick reaction.

heilerich avatar Aug 10 '23 08:08 heilerich

@heilerich we will consider these enhancements. Thanks

shivamerla avatar Aug 29 '23 06:08 shivamerla

Any plans to take this up? We have similar requirements to be able to disable some validation init containers

chiragjn avatar Jan 13 '24 13:01 chiragjn