zero-bubble-pipeline-parallelism icon indicating copy to clipboard operation
zero-bubble-pipeline-parallelism copied to clipboard

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training

Open HodBadichi opened this issue 8 months ago • 1 comments

I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.

I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the ncclDevKernel_SendRecv function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.

image

HodBadichi avatar Jun 14 '24 13:06 HodBadichi