YOLOX
YOLOX copied to clipboard
In Windows, multiple Gpus train my VOC datasets to report NCCL problems
Hello, the model can be trained normally when I use 1 GPU. But the following problem occurred when I tried to convert the model to 4 Gpus for training
(yolox_train) D:\mengxianchi\YOLOX\YOLOX>python tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 4 -b 32 --fp1
6 -o -c weights/yolox_s.pth
2022-06-22 14:58:49.750 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 1 initialization finished.
2022-06-22 14:58:49.844 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 3 initialization finished.
2022-06-22 14:58:50.089 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 0 initialization finished.
2022-06-22 14:58:50.170 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 2 initialization finished.
2022-06-22 14:58:50.183 | ERROR | yolox.core.launch:distributed_worker:126 - Process group URL: tcp://127.0.0.1:4931
5
Traceback (most recent call last):
File "tools/train.py", line 133, in
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
fn(i, *args)
File "d:\mengxianchi\yolox\yolox\yolox\core\launch.py", line 118, in _distributed_worker
dist.init_process_group(
File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in
init_process_group
_update_default_pg(_new_process_group_helper(
File "C:\ProgramData\Anaconda3\envs\yolox_train\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in
_new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in
There are many ways to try to solve it online, and want to ask you.
Failed to find a solution for multi-GPU training on the web
NCCL is not supported by windows, plz use gloo instead.
Thank you. I'll try to solve the problem in this way