Question about large-scale data training and packing algorithm for MOSS-style inputs

Open symhsym opened this issue 4 months ago • 0 comments

Thank you for your great work on MOSS—it’s been very inspiring!

I believe the model couldn't have been trained on individual samples sequentially due to efficiency concerns. Given MOSS's unique input format (delay shifting), standard sequence packing algorithms might not apply directly.

Could you please share:

What strategy was used for training data packing/grouping?

Any special considerations for handling the special tokens during batching?

If custom padding/packing approaches were developed, would you be willing to share the method?

This information would be extremely helpful for my own research with similar architectures. Thanks in advance for your insights!

Aug 08 '25 10:08 symhsym