h2o-llmstudio [BUG] Docker Image CUDA ERROR

🐛 Bug

I am getting the warning below and the nightly Docker image doesn't see my GPU. I have RTX 3090 with Driver Version: 470.182.03 CUDA Version: 11.4 on the host machine.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)

Docker Image CUDA version seems to be 11.8 and my driver version should support it.

To Reproduce

sudo docker run --runtime=nvidia --shm-size=64g --init --rm -p 10101:10101 -v pwd/data:/workspace/data -v pwd/output:/workspace/output gcr.io/vorvan/h2oai/h2o-llmstudio:nightly

May 10 '23 13:05 aerdem4

Thank you for reporting, @aerdem4

I am receiving the same error on my machine with a 3090 and the host cuda: Driver Version: 510.108.03 CUDA Version: 11.6

As everything runs smoothly on all other tested machines, I expected that to be a rare issue. Seems, I was wrong. I'll investigate more and try to find a solution. Do you have any special ENV vars set on your host machine regarding cuda? That is one thing that I have set differently on the machine where the docker can't initialize the GPU.

May 10 '23 19:05 pascal-pfeiffer

Do you have any special ENV vars set on your host machine regarding cuda?

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

May 10 '23 21:05 aerdem4

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

no tests that I am aware of. Other tests included A100, A10G, A6000, V100 (all successful)

I just tested a docker build with nvidia/cuda:11.6.2-devel-ubuntu20.04 and that seems to work. Could you maybe test it on your machine, too? I'll merge the fix/downgrade if you confirm (https://github.com/h2oai/h2o-llmstudio/pull/105).

May 11 '23 12:05 pascal-pfeiffer

It didn't work for me. Same error.

May 11 '23 17:05 aerdem4

When I change the base image to my host machine CUDA version, it works.

May 11 '23 18:05 aerdem4

Yeah, unfortunately bitsandbytes tries to use the global cuda instead of the local pytorch cuda which has caused all sorts of issues - this one might be related to it.

I hope that this PR fixes it: https://github.com/TimDettmers/bitsandbytes/pull/375

For now I am hesitant to try to address this too much, as it seems this is also only an issue on Docker for some setups. If manually fixing the cuda version fixes it for you, it sounds like a good workaround.

Otherwise it might be also good idea to run it outside of Docker with the make commands.

May 12 '23 10:05 psinger

Seems to be working for me on 3090 and docker but I seem to have different versions of stuff.

Docker image used: gcr.io/vorvan/h2oai/h2o-llmstudio:nightly (appears to be created on May 21, 2023, 6:12:32 AM)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
INFO:     127.0.0.1:33352 - "POST / HTTP/1.1" 200 OK
2023-05-21 09:17:29,190 - INFO: Initializing app ...
2023-05-21 09:17:29,201 - INFO: Initializing app ... done
2023-05-21 09:17:29,201 - INFO: Initializing client None
2023-05-21 09:17:29,239 - INFO: User name: anon
2023-05-21 09:17:29,242 - INFO: Downloading default dataset...

nvidia-smi (ran from inside of the container)

Sun May 21 11:40:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:00:10.0 Off |                  N/A |
|  0%   48C    P8               10W / 350W|      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

May 21 '23 11:05 krzysztofantczak

We updated bitsandbytes to 0.41.0 which should solve this

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those.

Please reopen if issues still persist.

Aug 18 '23 08:08 psinger

h2o-llmstudio h2o-llmstudio copied to clipboard

[BUG] Docker Image CUDA ERROR

🐛 Bug

To Reproduce

h2o-llmstudio
h2o-llmstudio copied to clipboard