Yolo-FastestV2 无法用torch.nn.parallel.DistributedDataParallel进行分布式训练？

无法用torch.nn.parallel.DistributedDataParallel进行分布式训练？

Open hexins opened this issue 3 years ago • 1 comments

尝试用DistributedDataParallel做分布式训练，会出问题，有时在total_loss.backward()的时候hang住，有时在 preds = model(imgs)这句报错： RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since find_unused_parameters=True is enabled, this likely means that not all forward outputs participate in computing loss. You can fix this by making sure all forward function outputs participat e in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 0: fpn.reg_head_3.block.9.bias, fpn.reg_head_3.block.9.weight, fpn.reg_head_3.block.8.weight, fpn.reg_head_3.block.6.bias, fpn.reg_head_3.block.6.weight, fpn.reg_head_3.block.5.weight, fpn.reg_head_3.block.4.bias, fpn.reg_head_3.block.4.weight, fpn.reg_head_3.block.0.weight, fpn.reg_head_3.block.1.weight, fpn.reg_head_3.block.1.bias, fpn.reg_head_3.block.3.weight

有没有遇到过这个问题？

Oct 17 '21 17:10 hexins

请问解决多卡训练了吗

Nov 12 '21 08:11 fangichao

Yolo-FastestV2 Yolo-FastestV2 copied to clipboard

无法用torch.nn.parallel.DistributedDataParallel进行分布式训练？

Yolo-FastestV2
Yolo-FastestV2 copied to clipboard