handson-ml3 icon indicating copy to clipboard operation
handson-ml3 copied to clipboard

[BUG]tensorflow cannot find GPU with "docker-compose up", but it works with "docker run". nvidia-smi shows GPU with cuda version error.

Open azsrz opened this issue 1 year ago • 0 comments

Thanks for helping us improve this project!

Before you create this issue Please make sure you are using the latest updated code and libraries: see https://github.com/ageron/handson-ml3/blob/main/INSTALL.md#update-this-project-and-its-libraries

Also please make sure to read the FAQ (https://github.com/ageron/handson-ml3#faq) and search for existing issues (both open and closed), as your question may already have been answered: https://github.com/ageron/handson-ml3/issues

Describe the bug tensorflow cannot find GPU with "docker-compose up", but it works with "docker run". To Reproduce make run make exec ipython

Then I get these:

In [1]: import tensorflow as tf
2023-04-30 20:38:26.804632: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Librar y (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-30 20:38:27.019561: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plug in cuBLAS when one has already been registered

In [2]: tf.config.list_physical_devices('GPU')
2023-04-30 20:38:30.584296: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (34)
2023-04-30 20:38:30.584411: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 3b0aa2bc90a6
2023-04-30 20:38:30.584443: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 3b0aa2bc90a6
2023-04-30 20:38:30.584559: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.s o DSO loaded into this program
2023-04-30 20:38:30.584654: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 530.30.2
Out[2]: []

The GPU is exposed to the container and can be verified by nvidia-smi. But here it also shows CUDA version error. +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: ERR! | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:81:00.0 Off | N/A | | 30% 36C P8 25W / 350W| 6MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

This is my docker-compose.yml version: "3"
services:
handson-ml3:
build:
context: ../
dockerfile: ./docker/Dockerfile.gpu #Dockerfile
args:
- username=devel
- userid=1000
container_name: handson-ml3
image: ageron/handson-ml3:latest-gpu #latest
restart: unless-stopped
logging:
driver: json-file
options:
max-size: 50m
ports:
- "8888:8888"
- "8890:8890"
- "6006:6006"
volumes:
- ../:/home/devel/handson-ml3
command: /opt/conda/envs/homl3/bin/jupyter-lab --ip=0.0.0.0 --port=8890 --no-browser
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu, utility]

azsrz avatar Apr 30 '23 20:04 azsrz