nsff_pl icon indicating copy to clipboard operation
nsff_pl copied to clipboard

Multi_gpu train issue

Open aha831 opened this issue 3 years ago • 1 comments

I changed the default parameters num_gpus-->8\num_nodes-->8, and keep other default parameters unchanged, then run the train.py with 8_v100, but it always stuck on the device initialization and can't go on the data read process, are there any additional settings require? 🤔 image

aha831 avatar Sep 24 '22 03:09 aha831

I tested other multi_number, it's all stuck, and can successly run when gpus=1

aha831 avatar Sep 24 '22 03:09 aha831