tmrl
tmrl copied to clipboard
Not using GPU in training
As mentionned in the title, the program doesn't seem to use my NVIDIA GPU (3050 ti) to train. Instead, CPU usage jumps to 100%
Hi, have you set the trainer to cuda in config.json?
CUDA_TRAINING is set to true but CUDA_INFERENCE is set to false
I didn't change the config file, it is still default
Strange, the trainer terminal should be using your GPU when running training steps then. Can you try to open another terminal and run nvidia-smi while training steps are being performed?
So this is what i got from nvidia-smi while running the training, still no gpu usage in task manager
Nvidia-smi says that 50% of your GPU memory is used, but I am not sure whether this is from Trackmania or from the trainer terminal. What happens if you close the worker terminal and the game and execute nvidia-smi while the trainer terminal is still performing training steps?
So when closing the game and the worker, the trainer still uses the CPU at 95%+ but the vram usage in nvidia-smi dropped to 1% so it's only the game using it I believe
So weird, I would expect pytorch to throw an error if it cannot use CUDA for any reason when CUDA_TRAINING is true
This is when I only use the laptop, I'll try later with my main computer as the server and trainer to see if something similar happens
I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837
Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be
Yes that is the setting we use for real training. I have never tried CUDA-enabled training locally on my laptop because I don't even have a CUDA-enabled version of pytorch on my laptop, I just use it to run the worker. Still, sounds weird that the worker doesn't saturate your laptop GPU, perhaps the CPU is a huge bottleneck in your setting, IDK
I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837
Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be
Yup, sounds relevant, perhaps your pytorch installation is not compatible with your CUDA version (11.7 according to nvidia-smi)?
Hi, could you solve/locate the issue?
Closing for inactivity as I cannot reproduce the issue, please reopen if you experience something similar