zero-bubble-pipeline-parallelism
zero-bubble-pipeline-parallelism copied to clipboard
Zero Bubble Pipeline Parallelism
**Your question** It seems that B's timing includes W, while W merely accounts for the time of gradient accumulation. In the megatron/core/pipeline_parallel/zb_schedules.py file, the function `schedule_b` counts the duration of...
I tried multiple sets of experiments, but found that ZB is better than 1F1B. Interleaved 1F1B seems to be slightly faster than ZB_V, slightly slower than ZB_2P but saves a...
Currently the limitation is that `(number_of_layers / number_of_stage)` needs to be a even number.
Hi, I currently want to adapt zbv for Paddle. In your work, the main role of rollback is to reduce synchronization. However, the grad_norm in the opt stage requires all_reduce_sum,...
**Your question** How can I profile bubble time in pipeline parallelism?
Hi, very appreciate your work. I have a question for zbh1 mode. This is one part of your code: ``` # For BWF pattern or in rank 0, we don't...
@ufotalent To implement a version using our own running engine and async IO @QPHutu To implement a version by modifying 1f1b schedule using sync IO