YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

train error :

Open ladyxuxu opened this issue 2 years ago • 2 comments

hi, i use the command:

python ${workspace}/train.py -f ${train_data_dir}/yolox_voc_s.py -d 4 -b 64 -c ${weights_data_dir}/yolox_s.pth

But,error occur

File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

work = default_pg.barrier(opts=opts)

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3

ncclSystemError: System call (socket, malloc, munmap, etc) failed.

SO,could you mind to tell me how can i train the yolox use multi gpu.

ladyxuxu avatar Jul 06 '22 02:07 ladyxuxu

ubuntu 20.03 yolox 0.3.0 python :3.6.9 pytorch:1.10.1 cuda version:10.2
driver vision:460.56

ladyxuxu avatar Jul 06 '22 02:07 ladyxuxu

  1. check if other code using DDP works on your machine.
  2. If it could not working, try another version torch or check your hardware
  3. If it works, try add import yolox in your code and feedback to us.

FateScript avatar Jul 06 '22 03:07 FateScript

I also met this issue, but it seem the reason of torch, add --ipc==host in docker run i addressed this.

Leon-Cheung-CQ avatar Jun 07 '23 09:06 Leon-Cheung-CQ