Yolo-FastestV2
Yolo-FastestV2 copied to clipboard
无法用torch.nn.parallel.DistributedDataParallel进行分布式训练?
尝试用DistributedDataParallel做分布式训练,会出问题,有时在total_loss.backward()的时候hang住,有时在
preds = model(imgs)这句报错:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your
module has parameters that were not used in producing loss. Since find_unused_parameters=True
is enabled, this likely means that
not all forward
outputs participate in computing loss. You can fix this by making sure all forward
function outputs participat e in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return
value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: fpn.reg_head_3.block.9.bias, fpn.reg_head_3.block.9.weight, fpn.reg_head_3.block.8.weight, fpn.reg_head_3.block.6.bias, fpn.reg_head_3.block.6.weight, fpn.reg_head_3.block.5.weight, fpn.reg_head_3.block.4.bias, fpn.reg_head_3.block.4.weight, fpn.reg_head_3.block.0.weight, fpn.reg_head_3.block.1.weight, fpn.reg_head_3.block.1.bias, fpn.reg_head_3.block.3.weight
有没有遇到过这个问题?
请问解决多卡训练了吗