zero-bubble-pipeline-parallelism
zero-bubble-pipeline-parallelism copied to clipboard
Zero Bubble Pipeline Parallelism
https://arxiv.org/abs/2405.15362
I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL...
i test llama2 13b on a800, the pp parallelism is 4 and micro-batch-size = 1 and global-batch-size = 64 the 1f1b log, i just use 1f1b, not use vp iteration...
I SEE zero-bubble-pipeline-parallelism disabled FusdLayerNorm,Is it because of the fused op can not split backward of w and x?