Zhipeng Zhang issues

Results 7 issues of


                                            Zhipeng Zhang

Analysis of loss spikes in LLaMA pretrain

Dear LLaMA Teams, A huge thank you for making your remarkable work available to the public! I've taken a close look at the pretraining loss curves depicted in Figure 1...

Fix wrong key for output_layer_init_method

Given that we aim to use mcore to do the training, we have a function to parse the args from Megatron-LM to mcore. Howover, the key of `output_layer_init_method ` is...

stale

Analysis of loss spikes in LLaMA pretrain

Dear LLaMA Teams, A huge thank you for making your remarkable work available to the public! I've taken a close look at the pretraining loss curves depicted in Figure 1...

MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism

Hi, I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential to...

enhancement

Quantitative Analysis of FP8 GEMM's Impact on LLM Convergence

Hi, I've been exploring the impressive work you've done on incorporating FP8 GEMM to accelerate tensor matrix multiplication operations in TransformerEngine. The initiative is well-support by the findings in the...

FA3 unit test fails

After running ```sh cd hopper python setup.py install export PYTHONPATH=$PWD pytest -q -s test_flash_attn.py ``` I got the following assertion error: ``` FAILED test_flash_attn.py::test_flash_attn_output[257-1-128-False-False-mha-dtype0] - AssertionError: assert 0.0078125

[BUG] Hopper groupgemm example fails for mnk(1638, 6144, 3584)

**Describe the bug** The example code here [1] fails to run mnk=(1638, 6144, 3584) and `Got cutlass error: Invalid status at: 670`. **Steps/Code to reproduce bug** ``` cd cutlass/examples/57_hopper_grouped_gemm nvcc...

bug

? - Needs Triage