tmrl icon indicating copy to clipboard operation
tmrl copied to clipboard

Not using GPU in training

Open Zach3292 opened this issue 1 year ago • 12 comments

As mentionned in the title, the program doesn't seem to use my NVIDIA GPU (3050 ti) to train. Instead, CPU usage jumps to 100%

TMRL result

Zach3292 avatar Jul 22 '22 14:07 Zach3292

Hi, have you set the trainer to cuda in config.json?

yannbouteiller avatar Jul 22 '22 16:07 yannbouteiller

CUDA_TRAINING is set to true but CUDA_INFERENCE is set to false

I didn't change the config file, it is still default

Zach3292 avatar Jul 22 '22 18:07 Zach3292

Strange, the trainer terminal should be using your GPU when running training steps then. Can you try to open another terminal and run nvidia-smi while training steps are being performed?

yannbouteiller avatar Jul 22 '22 20:07 yannbouteiller

So this is what i got from nvidia-smi while running the training, still no gpu usage in task manager Screenshot 2022-07-22 172029

Zach3292 avatar Jul 22 '22 21:07 Zach3292

Nvidia-smi says that 50% of your GPU memory is used, but I am not sure whether this is from Trackmania or from the trainer terminal. What happens if you close the worker terminal and the game and execute nvidia-smi while the trainer terminal is still performing training steps?

yannbouteiller avatar Jul 22 '22 21:07 yannbouteiller

So when closing the game and the worker, the trainer still uses the CPU at 95%+ but the vram usage in nvidia-smi dropped to 1% so it's only the game using it I believe image

Zach3292 avatar Jul 22 '22 21:07 Zach3292

So weird, I would expect pytorch to throw an error if it cannot use CUDA for any reason when CUDA_TRAINING is true

yannbouteiller avatar Jul 22 '22 21:07 yannbouteiller

This is when I only use the laptop, I'll try later with my main computer as the server and trainer to see if something similar happens

Zach3292 avatar Jul 22 '22 21:07 Zach3292

I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837

Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be

Zach3292 avatar Jul 22 '22 22:07 Zach3292

Yes that is the setting we use for real training. I have never tried CUDA-enabled training locally on my laptop because I don't even have a CUDA-enabled version of pytorch on my laptop, I just use it to run the worker. Still, sounds weird that the worker doesn't saturate your laptop GPU, perhaps the CPU is a huge bottleneck in your setting, IDK

yannbouteiller avatar Jul 22 '22 22:07 yannbouteiller

I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837

Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be

Yup, sounds relevant, perhaps your pytorch installation is not compatible with your CUDA version (11.7 according to nvidia-smi)?

yannbouteiller avatar Jul 22 '22 22:07 yannbouteiller

Hi, could you solve/locate the issue?

yannbouteiller avatar Aug 09 '22 15:08 yannbouteiller

Closing for inactivity as I cannot reproduce the issue, please reopen if you experience something similar

yannbouteiller avatar Sep 12 '22 15:09 yannbouteiller