[DDP] 并行训练问题/Multi-head shared-backbone model triggers “Expected to mark a variable ready only once” — how to parallelize training?
In multi-GPU DDP training, the model has a shared backbone (LLM) and multiple output heads (8 channels, each computing a different loss). In a single forward pass, all heads use the same backbone parameters (e.g., down_proj.weight) to compute 8 separate losses, which are then weighted, summed into a single total_loss, and backpropagated once.
Single-GPU training works fine, but in multi-GPU DDP we get:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss
or
Parameter at index 314 with name model.language_model.layers.27.mlp.down_proj.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
Tried: • Disabled gradient_checkpointing • find_unused_parameters=True/False • model._set_static_graph() + static_graph=True • Only one backward() call
Still fails. I suspect this is a known limitation in DDP when handling multi-output, multi-loss architectures with shared parameters. Could the maintainers share the recommended way to handle this, and how do you parallelize training in similar architectures?
求解!!
Thanks for the feedback. We've fixed the issue. Please pull the latest code and try again.
Thank you very much for your reply—this will be very helpful. Additionally, have you considered packing the fine-tuning data to further reduce the fine-tuning time? If possible, please provide some guidance on this. I believe it would greatly benefit those using your model for downstream tasks.