YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

In Windows, multiple Gpus train my VOC datasets to report NCCL problems

Open 1605707467qq opened this issue 2 years ago • 3 comments

Hello, the model can be trained normally when I use 1 GPU. But the following problem occurred when I tried to convert the model to 4 Gpus for training

image-20220622161317189

(yolox_train) D:\mengxianchi\YOLOX\YOLOX>python tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 4 -b 32 --fp1 6 -o -c weights/yolox_s.pth 2022-06-22 14:58:49.750 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 1 initialization finished. 2022-06-22 14:58:49.844 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 3 initialization finished. 2022-06-22 14:58:50.089 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 0 initialization finished. 2022-06-22 14:58:50.170 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 2 initialization finished. 2022-06-22 14:58:50.183 | ERROR | yolox.core.launch:distributed_worker:126 - Process group URL: tcp://127.0.0.1:4931 5 Traceback (most recent call last): File "tools/train.py", line 133, in launch( File "d:\mengxianchi\yolox\yolox\yolox\core\launch.py", line 82, in launch mp.start_processes( File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start processes while not context.join(): File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
fn(i, *args) File "d:\mengxianchi\yolox\yolox\yolox\core\launch.py", line 118, in _distributed_worker dist.init_process_group( File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group _update_default_pg(_new_process_group_helper( File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " RuntimeError: Distributed package doesn't have NCCL built in

There are many ways to try to solve it online, and want to ask you.

1605707467qq avatar Jun 22 '22 08:06 1605707467qq

Failed to find a solution for multi-GPU training on the web

1605707467qq avatar Jun 22 '22 08:06 1605707467qq

NCCL is not supported by windows, plz use gloo instead.

FateScript avatar Jun 22 '22 08:06 FateScript

Thank you. I'll try to solve the problem in this way

1605707467qq avatar Jun 22 '22 08:06 1605707467qq