GaitMixer icon indicating copy to clipboard operation
GaitMixer copied to clipboard

Program interrupt when multi-GPU training

Open hxi667 opened this issue 1 year ago • 7 comments

Hi, it is great work! But I also needed some help. When I run train.py with multiple GPUs, (for example, the "--gpus" parameter is set to "0,1,2,3,4,5,6,7"), my program interrupts but returns no errors. I found that the interrupts occurred in the "loss.backward()" line of code. Can you give me some advice? Thank you very much!!

hxi667 avatar Mar 07 '23 01:03 hxi667

It may be something with the GPU environment. Have you tried with only 1 GPU and 2 GPUs? (export CUDA_VISIBLE_DEVICES=0)

exitudio avatar Mar 07 '23 02:03 exitudio

Yes, no problem when I'm just using a gpu, I've set os.environ["CUDA_VISIBLE_DEVICES"]="0,1" , but it still doesn't work.

hxi667 avatar Mar 07 '23 02:03 hxi667

Are you using clusters or multiprocess? The code uses DataParallel so it doesn't support multiprocess.

exitudio avatar Mar 07 '23 02:03 exitudio

Yes, I know this code uses DataParallel, I don't use multiprocess. As a comparison, I can use 8 GPu's on GaitGraph.

hxi667 avatar Mar 07 '23 02:03 hxi667

One difference from GaitGraph is we use Triplet loss from pytorch_metric_learning . But it shouldn't be a problem. It also works on my 4 GPU server.

exitudio avatar Mar 07 '23 02:03 exitudio

You can try --loss_func supcon to see that the Triplet loss causes this problem or not.

exitudio avatar Mar 07 '23 02:03 exitudio

I changed the conda environment to the one used by Garph and the problem was solved! I guess it could be a certain package version that is causing the problem. Thank you again for your kind answers!

hxi667 avatar Mar 07 '23 03:03 hxi667