torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

PP hangs when pipeline_parallel_microbatches < pipeline_parallel_degree

Open cassanof opened this issue 9 months ago • 13 comments

Pipeline parallelism seem to hang when the number of microbatches is less than the degree. This issue occurs for both the standard and interleaved 1F1B schedules. Have not tested other schedules.

cassanof avatar Jan 06 '25 00:01 cassanof