[REQUEST] Disable zero stage2 all_gather‘s bucketing

Open li-yi-dong opened this issue 2 years ago • 2 comments

It seems that zero stage2 can never overlap all_gather with computation at the end of each step. Bucketing the all_gather only makes it slower, and eats more GPU memory.

Apr 21 '23 09:04 li-yi-dong

@li-yi-dong, can you please share a bit more details of this issue? It would be helpful to share the model, scripts, and log snippets. Thanks.

Apr 21 '23 17:04 tjruwase

@li-yi-dong, can you please share a bit more details of this issue? It would be helpful to share the model, scripts, and log snippets. Thanks.

Hi @tjruwase I'm training a GPT model with Megatron-DeepSpeed, using zero stage2. Here is the timeline with default all_gather bucketing strategy: 截屏2023-04-23 上午9 20 07 The default bucket size is 5e8, and the all_gather at the end of each step takes around 2.6 seconds.

I manually set bucket size to 5e10, which is sufficiently large to disable bucketing: 截屏2023-04-23 上午9 26 54 The all_gather reduces to 1.4 seconds.

Besides, the default bucket size, 5e8, consumes more GPU memory. By setting the bucket size to 1e10, I'm able to train with larger batch size without CUDA OOM.

Apr 23 '23 01:04 li-yi-dong