Yolo-FastestV2 icon indicating copy to clipboard operation
Yolo-FastestV2 copied to clipboard

无法用torch.nn.parallel.DistributedDataParallel进行分布式训练?

Open hexins opened this issue 3 years ago • 1 comments

尝试用DistributedDataParallel做分布式训练,会出问题,有时在total_loss.backward()的时候hang住,有时在 preds = model(imgs)这句报错: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since find_unused_parameters=True is enabled, this likely means that not all forward outputs participate in computing loss. You can fix this by making sure all forward function outputs participat e in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 0: fpn.reg_head_3.block.9.bias, fpn.reg_head_3.block.9.weight, fpn.reg_head_3.block.8.weight, fpn.reg_head_3.block.6.bias, fpn.reg_head_3.block.6.weight, fpn.reg_head_3.block.5.weight, fpn.reg_head_3.block.4.bias, fpn.reg_head_3.block.4.weight, fpn.reg_head_3.block.0.weight, fpn.reg_head_3.block.1.weight, fpn.reg_head_3.block.1.bias, fpn.reg_head_3.block.3.weight

有没有遇到过这个问题?

hexins avatar Oct 17 '21 17:10 hexins

请问解决多卡训练了吗

fangichao avatar Nov 12 '21 08:11 fangichao