CUDA_ERROR_SYSTEM_DRIVER_MISMATCH
Hello We have a K8S node with 8 GPUs of type H200 on RHEL 9.4. CUDA version is 12.8 and Driver version is 570.133.20.
We have configured GPU operator v25.3.0. All daemons are working fine.
We have 3 PODs running with different versions of cuda-python installed/configured - 12.5.1, 12.6.1 and 12.8.1.
When we run the same test program in all these PODs only the POD with cuda-python v12.8.1 completes successfully. The other 2 fail with CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error.
I have chosen the option driver.enabled=true so that containers do not rely on what is installed on the system.
I have gone through RKE2 Nvidia Operator documentation and everything seems alright. I can see the label nvidia.com/gpu.deploy.driver=pre-installed on the node so, the operator sees pre-existing drivers.
So, what should I be doing to successfully invoke different versions of cuda-python on the same node? This problem does not exist in AWS EKS so, what am I missing while configuring in on-prem nodes?
Looking at https://github.com/NVIDIA/gpu-operator/issues/126, I can see that upgrades have been made to use pre-existing drivers on the host. Can the operator be forced to ignore pre-existing drivers on host and use a driver version specified in the helm chart?
GPU operator install guide says the following:
If you do not specify the driver.enabled=false argument and nodes in the cluster have a pre-installed GPU driver, the init container in the driver pod detects that the driver is preinstalled and labels the node so that the driver pod is terminated and does not get re-scheduled on to the node. The Operator proceeds to start other pods, such as the container toolkit pod.
What if we specify driver.enabled=true even though host driver exists?
If we see LD_DEBUG when running Nvidia gpu commands, it is loading libcuda-560.x.so (The driver module lib) module from /usr/local/cuda/compact. This is the the default provided by Nvidia. But the driver specific module (libcuda.570.133.20.so) is in /usr/lib/cuda/lib64. This is the reason for mismatch and the error "CUDA_DRIVER_MISMATCH"
This issue is due to:
In triton inference server, while building the image, Nvidia has set "LD_LIBRARY_PATH" env variable to /usr/local/cuda/compact than /usr/lib/cuda/lib64. It will work with lower versions of driver due to backward compatibility as /usr/local/cuda/compact has libcuda.560.so file But with 570.x it is failing.
Remediation : Set LD_LIBRARY_PATH LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH This will give preference to the latest driver spec libraries, if you are using k8s set this isn triton server env variables
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.