open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Use device_create to ensure /dev nodes are created correctly.

Open arnej27959 opened this issue 2 years ago • 2 comments

After following installation instruction for CUDA on RHEL 8.8, I got into problems later on; after debugging with system call tracing it turned out because some of the device nodes like /dev/nvidia-uvm or /dev/nvidiactl did not exist. There are tips in https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#device-node-verification for how to fix this manually, but that should not really be necessary. Currently there are rules in /usr/lib/udev/rules.d/60-nvidia.rules which creates these using "mknod", but "journalctl" showed that they fail randomly:

Aug 16 09:35:34 gpu-test-arnej-1 sudo[6801]: arnej_yahooinc_com : TTY=pts/0 ; PWD=/home/arnej_yahooinc_com ; USER=root ; COMMAND=/bin/nvidia-modprobe
Aug 16 09:35:34 gpu-test-arnej-1 sudo[6801]: pam_unix(sudo:session): session opened for user root by arnej_yahooinc_com(uid=0)
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia: module license 'NVIDIA' taints kernel.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: Disabling lock debugging due to kernel taint
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[614]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 240
Aug 16 09:35:35 gpu-test-arnej-1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.86.10  Wed Jul 26 23:20:03 UTC 2023
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6804]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6812]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6804]: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia-uvm: Loaded the UVM driver, major device number 238.

Best practice however is that the device driver should trigger creation directly using device_create() kernel function, using "mknod" in udev rules is not the usual way to solve this.

This PR takes care of calling device_create() as needed and device_destroy() to cleanup when module or device is detached. I have tested it after disabling udev rules by using "rmmod" and "modprobe" to load and unload modules, and of course also that it works on reboot.

arnej27959 avatar Aug 16 '23 14:08 arnej27959

The original justification for not using device_create is probably that it is marked as EXPORT_SYMBOL_GPL, and as such can't be used in a proprietary module, that is not a problem in the open kernel module, but it adds additional difference between them.

kanashimia avatar Oct 21 '23 00:10 kanashimia

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jun 06 '24 06:06 CLAassistant