[BUG]Error in get_transformer_layer_offset when virtual_pipeline=True, num_layers_in_last_pipeline_stage is set, and num_layers_in_first_pipeline_stage is not
Describe the bug
When virtual_pipeline is enabled and num_layers_in_last_pipeline_stage is set without num_layers_in_first_pipeline_stage, get_transformer_layer_offset incorrectly calculates the offset for all ranks other than rank 0.
Proposed fix fix_get_transformer_layer_offset #1583
Here, the (pipeline_rank - 1) calculation when VPP is enabled doesn't consider the case where num_layers_in_first_pipeline_stage is None, unlike the logic when VPP is disabled below.
@yanring Hey, I propose a simple pr to solve this issue. Can anyone review it?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hey @yqy3214, sorry for the late update. We've recently added a new feature to handle uneven splits more effectively, with a customizable layout. Could you please check this link? It also properly handles the layer offset. Thanks!
https://github.com/NVIDIA/Megatron-LM/blob/84cf979c766f72dfdc7af73d6b4add5ae952c2da/megatron/training/arguments.py#L2269
Thank you for contribution!
We fixed it by following your PR. https://github.com/NVIDIA/Megatron-LM/commit/a77a883e248e68df1912df4ef2cf05b712947fce
Let us know what you think.
Resolved.