DeepSpeedExamples ZeRO Stage 2 consumes more GPU memory than Stage 1

I was training an GPT-Neo (2.8B) model using the step1 script on 4 A10G GPUs. I used the default parameters in the example script but zero_stage=2 is consuming more GPU memory than zero_stage=1. Any solutions for that?

Apr 17 '23 23:04 puyuanOT

Hi @puyuanOT can you please indicate how you are measuring the memory usage? The output logs (training.log will be located somewhere in ./output/) will contain DeepSpeed output with current and max memory usage. These will give the most accurate measurement for GPU memory usage.

Apr 19 '23 00:04 mrwyattii

@mrwyattii Thanks for the reply! I was using nvidia-smi to measure the memory cost. I was able to train pythia-2.8B with max_length=1280 using stage=1, but got OOM error with stage=2.

Apr 19 '23 00:04 puyuanOT

Hello @puyuanOT, I confirmed that the peak memory usage of stage 2 can be larger than that of stage 1 using the training script of step 1.

In stage 2, contiguous_gradients is set to true as default and DeepSpeed allocates the gradient buffer at the beginning of a backward pass. This increases the peak memory typically when the dominant memory usage is activation, not parameters or gradients (e.g. when gradient checkpointing is not enabled).

There are several options to reduce memory usages regarding this issue:

Set contiguous_gradients to false: Stage 2 won't need additional memory but may increase the overhead of memory fragmentation. (See the document for the detail)
Reduce reduce_bucket_size: You can set a smaller buffer size contiguous_gradients. (See the document for the detail)
Enable gradient checkpointing: The impact of sharding gradient will be significant and you could clearly see the advantage of stage 2.

I hope this helps you find a better training configuration on your environment.

May 07 '23 04:05 tohtana

Closing because we have no further information. Feel free to reopen if the problem still exists.

May 22 '23 21:05 tohtana