zero-bubble-pipeline-parallelism
zero-bubble-pipeline-parallelism copied to clipboard
[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training
I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.
I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the ncclDevKernel_SendRecv
function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.