Can`t train the model with GPU on a server with RTX3090
I first ran the code with its default config on my server, but later i noticed that the training process was actually on my CPU , and nvidia-smi returned error.
After that, I found it on Dockerhub that I can use the GPU in container with --gpus all when run the docker, that is to say, replace
docker run --rm -m4g -v /path/to/data:/mnt/data -it ratsql
with
docker run --rm --gpus all -m4g -v /path/to/data:/mnt/data -it ratsql
I then found that nvidia-smi works in the container, but when I trained the modle, it turn out to be error like
"the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:331"
I searched that on the internet, it is said that cuda 11+ is necessary for GPU RTX30XX. Then I modified the dockerfile to
pytorch/pytorch:1.5-cuda10.1-cudnn7-devel
and rebuild the image, but the same error occured again.
I wonder whether I can train the model with GPU in docker .Kindly please help me to resolve this issue. Any help will be really appreciated.
The way I apprehend your issue is that you should check if PyTorch recognised your CUDA device. Try this in the Terminal/or any console: python3 -c "import torch; assert(torch.cuda.is_available())". What is the output?