res-loglikelihood-regression icon indicating copy to clipboard operation
res-loglikelihood-regression copied to clipboard

Timed out initializing process group in store based barrier on rank: 2

Open yotaroshimose opened this issue 2 years ago • 2 comments

Hi, Thank you for sharing your great work!

I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.

Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).

Any advise on successfully running your training code?

Thank you for your cooperation.

yotaroshimose avatar Jul 04 '22 10:07 yotaroshimose

@yotaroshimose bro,my english is not very good ,so maybe i wrongly understand your question,the following is my advice you can add a statement -> CUDA_VISIBLE_DEVICES=0,1 (i have 4 GPU but only use the first and the second) before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh wish help you

WjzZwd avatar Sep 23 '22 12:09 WjzZwd

Thank you for your kind reply. I will try the way of your device control. thank you.

yotaroshimose avatar Sep 26 '22 01:09 yotaroshimose