YOLOX
YOLOX copied to clipboard
train error :
hi, i use the command:
python ${workspace}/train.py -f ${train_data_dir}/yolox_voc_s.py -d 4 -b 64 -c ${weights_data_dir}/yolox_s.pth
But,error occur
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
SO,could you mind to tell me how can i train the yolox use multi gpu.
ubuntu 20.03
yolox 0.3.0
python :3.6.9
pytorch:1.10.1
cuda version:10.2
driver vision:460.56
- check if other code using DDP works on your machine.
- If it could not working, try another version torch or check your hardware
- If it works, try add
import yolox
in your code and feedback to us.
I also met this issue, but it seem the reason of torch, add --ipc==host in docker run i addressed this.