ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config.

Open yeegnauh opened this issue 1 year ago • 3 comments

🐛 Describe the bug

gemini / gemini_auto / zero2 / hybrid_parallel I have tried and still got OOM error.

with hybrid_parallel plugin , I tried configs as follows:

  1. tp=8, pp=1, zero=2, microbatch_size=1, precision="fp16"
  2. tp=4, pp=2, zero=1, microbatch_size=1 etc..

Is there anybody that trained llama 65B normally ?

Environment

torch1.13.1 + cu117 python 3.10

yeegnauh avatar Nov 30 '23 07:11 yeegnauh

Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

Fridge003 avatar Dec 05 '23 02:12 Fridge003

Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

Yes, I set batch_size=1 in the experiment. Do you have any recommendation for other configs?

yeegnauh avatar Dec 05 '23 03:12 yeegnauh

If the OOM error happens before the training loop, initialize the model under LazyInitContext might solve the problem (the usage can be referred to examples/language/llama2/pretrain.py)

If the OOM happens during training, there are two optimization methods coming to me:

  • you can set the offload_optim_frac argument to a value between 0 and 1 (the smallest value that avoids OOM) for GeminiPlugin. Or setting cpu_offload argument to True for LowLevelZeroPlugin or HybridParallelPlugin. Their function is similar, offload optimizer states to cpu memory to avoid OOM in GPU.
  • you can set enable_flash_attention to True for GeminiPlugin and HybridParallelPlugin, since flash attention will not only accelerate training but also save GPU memory

Fridge003 avatar Dec 05 '23 03:12 Fridge003