Zijie Yan comments

Results 12 comments of


Zijie Yan

Unable to import markdown into a Notion database

same here

Parameter fusion in optimizer partition makes lamb behaves differently

> @szhengac You are correct, LAMB and LARS implementations that are not aware of ZeRO will not work correctly with ZeRO. This is not a fundamental limitation of optimizer partitioning...

[BUG] Bug of expert model parallel

Thank you for let us know! We have a fix, but it's not yet merged. Temporarily WAR is replace `tensor_model_parallel_size * context_parallel_size` with just `tensor_model_parallel_size`.

[BUG] Bug of expert model parallel

This issue should has been resolved on https://github.com/NVIDIA/Megatron-LM/commit/b5aba3a2f3165da8b4f6b483bf3a6da2a24718e4

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: 1. Update the code to the...

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

> Hi, thanks for the suggestions. I retested the throuput according to your suggestion. To be more specific: > > 1. Update Megatron-LM the latest commit ([ba77325](https://github.com/NVIDIA/Megatron-LM/commit/ba773259dbe5735fbd91ca41e7f4ded60b335c52)) > 2. Update...

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

Hi @ShinoharaHare , our env is: 1. DGX H100, 64 GPUs. 2. [pytorch 24.03 image.](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-03.html). I double-checked your scripts and suggest the following modifications: 1. Seq Len: 2048 -> 4096...