yoloair icon indicating copy to clipboard operation
yoloair copied to clipboard

DDP error

Open jay985735639 opened this issue 2 years ago • 2 comments

Search before asking

  • [X] I have searched the YOLOAir issues and found no similar questions.

Question

Traceback (most recent call last): File "train.py", line 695, in main(opt) File "train.py", line 591, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 370, in train pred = model(imgs) # forward File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 787, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by

Additional

No response

jay985735639 avatar Sep 26 '22 02:09 jay985735639

python -m torch.distributed.run --nproc_per_node 2 train.py --data coco.yaml --cfg ./configs/backbone/yolov5_res2net50.yaml --batch-size 16 --epoch 300 --device 0,1 --weight ''

jay985735639 avatar Sep 26 '22 03:09 jay985735639

Does this issue occur when training other model configurations?

iscyy avatar Sep 29 '22 06:09 iscyy