Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG]Error in get_transformer_layer_offset when virtual_pipeline=True, num_layers_in_last_pipeline_stage is set, and num_layers_in_first_pipeline_stage is not

Open yqy3214 opened this issue 7 months ago • 2 comments

Describe the bug When virtual_pipeline is enabled and num_layers_in_last_pipeline_stage is set without num_layers_in_first_pipeline_stage, get_transformer_layer_offset incorrectly calculates the offset for all ranks other than rank 0.

Proposed fix fix_get_transformer_layer_offset #1583

yqy3214 avatar May 17 '25 10:05 yqy3214

Here, the (pipeline_rank - 1) calculation when VPP is enabled doesn't consider the case where num_layers_in_first_pipeline_stage is None, unlike the logic when VPP is disabled below.

Image

yqy3214 avatar May 17 '25 15:05 yqy3214

@yanring Hey, I propose a simple pr to solve this issue. Can anyone review it?

yqy3214 avatar May 22 '25 05:05 yqy3214

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 01 '25 02:08 github-actions[bot]

Hey @yqy3214, sorry for the late update. We've recently added a new feature to handle uneven splits more effectively, with a customizable layout. Could you please check this link? It also properly handles the layer offset. Thanks!

https://github.com/NVIDIA/Megatron-LM/blob/84cf979c766f72dfdc7af73d6b4add5ae952c2da/megatron/training/arguments.py#L2269

yanring avatar Aug 01 '25 03:08 yanring

Thank you for contribution!

We fixed it by following your PR. https://github.com/NVIDIA/Megatron-LM/commit/a77a883e248e68df1912df4ef2cf05b712947fce

Let us know what you think.

ZhiyuLi-Nvidia avatar Aug 01 '25 06:08 ZhiyuLi-Nvidia

Resolved.

ZhiyuLi-Nvidia avatar Aug 05 '25 18:08 ZhiyuLi-Nvidia