Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] why pipeline-model-parallel size should be greater than 2 with interleaved schedule ?

Open nullnonenilNULL opened this issue 10 months ago • 4 comments

Your question Ask a clear and concise question about Megatron-LM.

image

nullnonenilNULL avatar Mar 25 '24 09:03 nullnonenilNULL

You can't use interleaved schedule without pipeline parallel image

ethanhe42 avatar Mar 28 '24 17:03 ethanhe42

@ethanhe42 I wonder whether pipeline_model_parallel_size == 2 can be accepted?

yuantailing avatar Mar 31 '24 13:03 yuantailing

@ethanhe42 I wonder whether pipeline_model_parallel_size == 2 can be accepted?

@ethanhe42 same question.

nullnonenilNULL avatar Apr 02 '24 05:04 nullnonenilNULL

I think that pipeline_model_parallel_size == 2 can be accepted in practice but maybe with less or no benefits in reducing bubble ?

robotsp avatar Apr 07 '24 05:04 robotsp

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 06 '24 18:06 github-actions[bot]

It is because tensor_send_next and tensor_send_prev here are indistinguishable with PP=2: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/pipeline_parallel/p2p_communication.py#L586.

This is a non-issue with overlap_p2p_comm since we split forward and backward communication in steady state. We fixed this here: https://github.com/NVIDIA/Megatron-LM/commit/152c562067cc0de6cbc8fba2a5095208f30d10cd.

Going to mark this as closed, feel free to re-open if you have additional questions.

deepakn94 avatar Jun 06 '24 19:06 deepakn94