Cuda Driver error inside Docker.
Hi I am trying to run the repository in docker running on Ubuntu 20.04
The docker setup was successful and I was able to run the Jackle robot as expected. Then I tried to run wild_visual_navigation_ros inside the docker and I get the following error.
Command used
roslaunch wild_visual_navigation_ros wild_visual_navigation.launch
Error
This is the native cuda driver on my Ubuntu 20.04
Should the cuda version match between the docker image and my native system? Or is there something else that is causing the error.
Hi @brolinA thanks for reporting this. Can you check if you can:
- do
nvidia-smiinside the container? - do
python3 -c "import torch; print(torch.cuda.is_available())"inside the container
I share your concern that there could be some incompatibility or driver issue.
Hi @mmattamala, Thank you for the response. Unfortunately, I am not able to enter docker after restarting my system. I keep getting the following error.
[+] Running 0/0
⠋ Container docker-wvn_nvidia-1 Recreate 0.0s
Error response from daemon: unknown or invalid runtime name: nvidia
I have made sure that both nvidia-docker2 and nvidia-container-toolkit has been installed. Still it doesn't work. It was working the last time but doesn't work after restarting the PC.
Any idea how to tackle this issue.
I think that something is messed up with the nvidia docker configuration. Can you check that all the steps are correct? https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker
Similarly, this thread might have some tips: https://stackoverflow.com/questions/52865988/nvidia-docker-unknown-runtime-specified-nvidia
Hi @mmattamala, Thank you. I was able to fix it.
Here is the output of the commands you mentioned.
Good to know it helped.
Coming back to the original issue, it seems to be some mismatch between the driver in the host system and the container (because the nvidia-smi output doesn't match)
I'm a bit short on time at the moment to take a deeper look, but I recommend to search for similar issues with docker.
Adding on to this -- I had the exact same issue (same error messages).
Using nvidia-smi in the container showed CUDA 12.3 whereas outside of the container it showed CUDA 12.2.
Changing the Dockerfile to use:
FROM nvidia/cuda:12.2.2-runtime-ubuntu20.04 as base
fixed the issue for me.
Thanks @andreschreiber for the proposed fix! I'll close the issue