GaitMixer
GaitMixer copied to clipboard
Program interrupt when multi-GPU training
Hi, it is great work! But I also needed some help. When I run train.py with multiple GPUs, (for example, the "--gpus" parameter is set to "0,1,2,3,4,5,6,7"), my program interrupts but returns no errors. I found that the interrupts occurred in the "loss.backward()" line of code. Can you give me some advice? Thank you very much!!
It may be something with the GPU environment. Have you tried with only 1 GPU and 2 GPUs?
(export CUDA_VISIBLE_DEVICES=0
)
Yes, no problem when I'm just using a gpu, I've set os.environ["CUDA_VISIBLE_DEVICES"]="0,1" , but it still doesn't work.
Are you using clusters or multiprocess? The code uses DataParallel so it doesn't support multiprocess.
Yes, I know this code uses DataParallel, I don't use multiprocess. As a comparison, I can use 8 GPu's on GaitGraph.
One difference from GaitGraph is we use Triplet loss from pytorch_metric_learning . But it shouldn't be a problem. It also works on my 4 GPU server.
You can try --loss_func supcon
to see that the Triplet loss causes this problem or not.
I changed the conda environment to the one used by Garph and the problem was solved! I guess it could be a certain package version that is causing the problem. Thank you again for your kind answers!