symhsym

Results 2 issues of symhsym

Thank you for your great work on MOSS—it’s been very inspiring! I believe the model couldn't have been trained on individual samples sequentially due to efficiency concerns. Given MOSS's unique...

In multi-GPU DDP training, the model has a shared backbone (LLM) and multiple output heads (8 channels, each computing a different loss). In a single forward pass, all heads use...