MOSS-TTSD [DDP] 并行训练问题/Multi-head shared-backbone model triggers “Expected to mark a variable ready only once”

In multi-GPU DDP training, the model has a shared backbone (LLM) and multiple output heads (8 channels, each computing a different loss). In a single forward pass, all heads use the same backbone parameters (e.g., down_proj.weight) to compute 8 separate losses, which are then weighted, summed into a single total_loss, and backpropagated once.

Single-GPU training works fine, but in multi-GPU DDP we get:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss

or

Parameter at index 314 with name model.language_model.layers.27.mlp.down_proj.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

Tried: • Disabled gradient_checkpointing • find_unused_parameters=True/False • model._set_static_graph() + static_graph=True • Only one backward() call

Still fails. I suspect this is a known limitation in DDP when handling multi-output, multi-loss architectures with shared parameters. Could the maintainers share the recommended way to handle this, and how do you parallelize training in similar architectures？

Aug 04 '25 13:08 symhsym

求解！！

Aug 05 '25 02:08 symhsym

Thanks for the feedback. We've fixed the issue. Please pull the latest code and try again.

Aug 06 '25 12:08 rulerman

Thank you very much for your reply—this will be very helpful. Additionally, have you considered packing the fine-tuning data to further reduce the fine-tuning time? If possible, please provide some guidance on this. I believe it would greatly benefit those using your model for downstream tasks.

Aug 07 '25 02:08 symhsym

[DDP] 并行训练问题/Multi-head shared-backbone model triggers “Expected to mark a variable ready only once” — how to parallelize training?