gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Missing LIBDEVICE Folder with Nvidia-gpu Operator 23.9.2

Open rrajkumar1990 opened this issue 1 year ago • 5 comments

  1. Quick Debug Information
  • OS/Version: Red Hat Enterprise Linux 8.9
  • Kernel Version: 4
  • Redhat Openshift version : 4
  • Nvidia-gpu Operator Version: 23.9.2
  • Application Details:
    • Python Version: 3.11
    • Tensorflow Version: 2.15
    • Cuda Version (inside application pod): 12.2
    • CUDNN Version (inside application pod): 8.9.7.29
  1. Issue Description We are currently encountering an issue with the Nvidia-gpu operator version 23.9.2 installed on our Red Hat OpenShift cluster. Our application, running on Python 3.11 with Tensorflow 2.15, is unable to mount the GPU due to an error related to the absence of library(s) found in "LIBDEVICE" folder. (LIBDEVICE folder is missing )
  • Cluster Policy Check:
    • "Toolkit-enabled" is set to true. nvidiagpu-operator nvidia-gpu-cluster-policy
  • Nvidia-gpu Operator Pods nvidia-pods

3.Error Description: * Upon running the application, Tensorflow attempts to use the GPU and throws an error stating that the "LIBDEVICE" folder is not available.

0%| | 0/48 [00:00<?, ?it/s]2024-02-28 22:55:08.680356: W external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:504] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice. Searched for CUDA in the following directories: ./cuda_sdk_lib /usr/local/cuda-12.2 /usr/local/cuda /app/.venv/lib64/python3.11/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc /app/.venv/lib64/python3.11/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc

tensorflow_logs
  1. LIBDEVICE Folder Investigation:

    • Tensorflow is looking for the "LIBDEVICE" folder inside the Cuda path.
    • Manually checking the application pod at the Cuda 12.2 location does not reveal the presence of the "LIBDEVICE" directory. app_pod_cuda_folder
  2. Historical Context:

    • No such error was encountered with Nvidia-gpu operator version 23.3.2, Cuda 11.8, and Tensorflow 2.13.

6.Expected Behavior: We expect that, with the installation of Nvidia-gpu operator version 23.9.2 and the toolkit enabled as true, our application should not encounter any errors from Tensorflow when attempting to utilize the GPU.

  1. Impact: This issue is hindering the proper functioning of our application, preventing it from taking advantage of the GPU resources.

8.Request: We kindly request your assistance in investigating and resolving this issue. If necessary, we are available to provide additional information or perform further testing to assist in the debugging process. Thank you for your prompt attention to this matter. We appreciate your efforts in ensuring the smooth operation of our GPU-enabled environment.

rrajkumar1990 avatar Mar 08 '24 06:03 rrajkumar1990

unable to mount the GPU due to an error related to the absence of library(s) found in "LIBDEVICE" folder. (LIBDEVICE folder is missing )

What does running nvidia-smi from within the container show? The error message seems to indicate there are dependencies missing your container image. Have you tried the recommendations from https://github.com/tensorflow/tensorflow/issues/58681#issuecomment-1485198011?

cdesiniotis avatar Mar 08 '24 18:03 cdesiniotis

“nvidia-smi” normally shows the device . There is no issue . also since it is an operator installation , shouldn’t the nvidia-gpu operator come with all the dependencies ? if you check my screenshot with the Nvidia-gpu operator pods , there is already toolkit enabled. Shouldnt that mean Nvidia should give all the dependencies along with it?

rrajkumar1990 avatar Mar 09 '24 03:03 rrajkumar1990

  1. We plan to ship this code (in Open Shift env) assuming NVIDIA operators are there to setup the environment. The proposed solution, https://github.com/tensorflow/tensorflow/issues/58681#issuecomment-1485198011, requires additional Installation steps by the customer. This is not a feasible option because, we expect NVIDIA GPU operator to contain all the CUDA libraries.

  2. Is there any other available operator from NVIDIA which contains all CUDA libraries? If tensorflow > 2.10 requires libdevice and is not shipping, is NVIDIA planning to include this library in NVIDIA GPU operator installation ?

rrajkumar1990 avatar Mar 13 '24 03:03 rrajkumar1990

Hi @cdesiniotis , Wanted to check if you got a chance to check https://github.com/NVIDIA/gpu-operator/issues/676#issuecomment-1993333506

rrajkumar1990 avatar Mar 19 '24 06:03 rrajkumar1990

is NVIDIA planning to include this library in NVIDIA GPU operator installation ?

No. The GPU Operator installs the NVIDIA GPU kernel driver, which consists of the kernel module and user-space driver libraries. The GPU Operator does not install the CUDA toolkit and runtime. The CUDA toolkit, runtime, and all other application dependencies are expected to be included in application container images.

cdesiniotis avatar Apr 01 '24 22:04 cdesiniotis