ZhiyuLi-Nvidia comments

Results 16 comments of


                                            ZhiyuLi-Nvidia

Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

> Update: > The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4. > we may need another person to look into checkpointing failure...

Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

> @ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged? @guyueh1 For GB200, the above change is a just a **temporary fix** to walk around and it...

Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

> > a nccl collective timeout when scaling up to 64 nodes > > I'd expect we need some help from experts in mcore checkpoint saving. Now synced up with...

compute_rnnt_timestamps fails with empty char offsets

@nithinraok could you help take a look?

[BUG]Error in get_transformer_layer_offset when virtual_pipeline=True, num_layers_in_last_pipeline_stage is set, and num_layers_in_first_pipeline_stage is not

Thank you for contribution! We fixed it by following your PR. https://github.com/NVIDIA/Megatron-LM/commit/a77a883e248e68df1912df4ef2cf05b712947fce Let us know what you think.

[BUG]Error in get_transformer_layer_offset when virtual_pipeline=True, num_layers_in_last_pipeline_stage is set, and num_layers_in_first_pipeline_stage is not

Resolved.