docs or demos to illustrate the key features
Congrats on the release! Some of the features are so cool they feel like black magic.
Would it be possible to explain the key techniques behind those features, or provide a tutorial/demo so users can reproduce the claimed results?
In particularly, I am interested in the claim
Memory-efficient design: Train 200B MoE models on 64k sequence lengths without sequence parallelism through advanced memory optimization techniques
It sounds very challenging, unless there’s aggressive offloading and recomputation and may suffer from slow iteration speed.
@qsh-zh Thank you for your attention!
Indeed, offloading and re-computation techniques will be used, but they will not affect the speed. The speed test results in the README on 256 * H800 are the actual results based on 64k packed data.
We are supplementing the relevant documentation, as well as instructions on how to reproduce it.
@pppppM thanks for your response! looking forwarding reproducible instructions.
offloading and re-computation techniques will be used, but they will not affect the speed.
I don’t see how recomputation would not affect the speed.
Could you please also share the code link for the strategy of overlapping offloading and computation, which sounds very interesting that it can not affect tthe speed.