symhsym
Results
2
issues of
symhsym
Thank you for your great work on MOSS—it’s been very inspiring! I believe the model couldn't have been trained on individual samples sequentially due to efficiency concerns. Given MOSS's unique...
In multi-GPU DDP training, the model has a shared backbone (LLM) and multiple output heads (8 channels, each computing a different loss). In a single forward pass, all heads use...