Zijie Yan

Results 12 comments of Zijie Yan

> @szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning...

Thank you for let us know! We have a fix, but it's not yet merged. Temporarily WAR is replace `tensor_model_parallel_size * context_parallel_size` with just `tensor_model_parallel_size`.

This issue should has been resolved on https://github.com/NVIDIA/Megatron-LM/commit/b5aba3a2f3165da8b4f6b483bf3a6da2a24718e4

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: 1. Update the code to the...

> Hi, thanks for the suggestions. I retested the throuput according to your suggestion. To be more specific: > > 1. Update Megatron-LM the latest commit ([ba77325](https://github.com/NVIDIA/Megatron-LM/commit/ba773259dbe5735fbd91ca41e7f4ded60b335c52)) > 2. Update...

Hi @ShinoharaHare , our env is: 1. DGX H100, 64 GPUs. 2. [pytorch 24.03 image.](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-03.html). I double-checked your scripts and suggest the following modifications: 1. Seq Len: 2048 -> 4096...

> > Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: > > > >...

> Does grouped_gemm support variable token lengths to local experts on the same rank? Yes, we support variable lengths for inputs from each local expert.

> which modification brings the most speed improvement? btw I encountered some error when converting mixtral from transformers to Megatron when grouped-gemm is set, can you share some converting scripts?...