rat-sql icon indicating copy to clipboard operation
rat-sql copied to clipboard

Can`t train the model with GPU on a server with RTX3090

Open Quasimoodo opened this issue 4 years ago • 1 comments

I first ran the code with its default config on my server, but later i noticed that the training process was actually on my CPU , and nvidia-smi returned error. After that, I found it on Dockerhub that I can use the GPU in container with --gpus all when run the docker, that is to say, replace docker run --rm -m4g -v /path/to/data:/mnt/data -it ratsql with docker run --rm --gpus all -m4g -v /path/to/data:/mnt/data -it ratsql I then found that nvidia-smi works in the container, but when I trained the modle, it turn out to be error like "the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:331" I searched that on the internet, it is said that cuda 11+ is necessary for GPU RTX30XX. Then I modified the dockerfile to pytorch/pytorch:1.5-cuda10.1-cudnn7-devel and rebuild the image, but the same error occured again. I wonder whether I can train the model with GPU in docker .Kindly please help me to resolve this issue. Any help will be really appreciated.

Quasimoodo avatar Aug 08 '21 16:08 Quasimoodo

The way I apprehend your issue is that you should check if PyTorch recognised your CUDA device. Try this in the Terminal/or any console: python3 -c "import torch; assert(torch.cuda.is_available())". What is the output?

m1nhtu99-hoan9 avatar Aug 27 '21 09:08 m1nhtu99-hoan9