wild_visual_navigation icon indicating copy to clipboard operation
wild_visual_navigation copied to clipboard

Cuda Driver error inside Docker.

Open brolinA opened this issue 1 year ago • 5 comments

Hi I am trying to run the repository in docker running on Ubuntu 20.04

The docker setup was successful and I was able to run the Jackle robot as expected. Then I tried to run wild_visual_navigation_ros inside the docker and I get the following error.

Command used roslaunch wild_visual_navigation_ros wild_visual_navigation.launch

Error cuda_error

This is the native cuda driver on my Ubuntu 20.04

image

Should the cuda version match between the docker image and my native system? Or is there something else that is causing the error.

brolinA avatar Jun 14 '24 10:06 brolinA

Hi @brolinA thanks for reporting this. Can you check if you can:

  • do nvidia-smi inside the container?
  • do python3 -c "import torch; print(torch.cuda.is_available())" inside the container

I share your concern that there could be some incompatibility or driver issue.

mmattamala avatar Jun 16 '24 15:06 mmattamala

Hi @mmattamala, Thank you for the response. Unfortunately, I am not able to enter docker after restarting my system. I keep getting the following error.

[+] Running 0/0
 ⠋ Container docker-wvn_nvidia-1  Recreate                                                                                                                                                                    0.0s 
Error response from daemon: unknown or invalid runtime name: nvidia

I have made sure that both nvidia-docker2 and nvidia-container-toolkit has been installed. Still it doesn't work. It was working the last time but doesn't work after restarting the PC.

Any idea how to tackle this issue.

brolinA avatar Jun 18 '24 12:06 brolinA

I think that something is messed up with the nvidia docker configuration. Can you check that all the steps are correct? https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker

Similarly, this thread might have some tips: https://stackoverflow.com/questions/52865988/nvidia-docker-unknown-runtime-specified-nvidia

mmattamala avatar Jun 18 '24 12:06 mmattamala

Hi @mmattamala, Thank you. I was able to fix it.

Here is the output of the commands you mentioned.

image

brolinA avatar Jun 18 '24 12:06 brolinA

Good to know it helped.

Coming back to the original issue, it seems to be some mismatch between the driver in the host system and the container (because the nvidia-smi output doesn't match)

I'm a bit short on time at the moment to take a deeper look, but I recommend to search for similar issues with docker.

mmattamala avatar Jun 21 '24 09:06 mmattamala

Adding on to this -- I had the exact same issue (same error messages). Using nvidia-smi in the container showed CUDA 12.3 whereas outside of the container it showed CUDA 12.2. Changing the Dockerfile to use: FROM nvidia/cuda:12.2.2-runtime-ubuntu20.04 as base fixed the issue for me.

andreschreiber avatar Aug 24 '24 15:08 andreschreiber

Thanks @andreschreiber for the proposed fix! I'll close the issue

mmattamala avatar Sep 06 '24 09:09 mmattamala