ColossalAI
ColossalAI copied to clipboard
[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config.
🐛 Describe the bug
gemini / gemini_auto / zero2 / hybrid_parallel I have tried and still got OOM error.
with hybrid_parallel plugin , I tried configs as follows:
- tp=8, pp=1, zero=2, microbatch_size=1, precision="fp16"
- tp=4, pp=2, zero=1, microbatch_size=1 etc..
Is there anybody that trained llama 65B normally ?
Environment
torch1.13.1 + cu117 python 3.10
Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.
If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.
Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.
If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.
Yes, I set batch_size=1 in the experiment. Do you have any recommendation for other configs?
If the OOM error happens before the training loop, initialize the model under LazyInitContext might solve the problem (the usage can be referred to examples/language/llama2/pretrain.py)
If the OOM happens during training, there are two optimization methods coming to me:
- you can set the
offload_optim_fracargument to a value between 0 and 1 (the smallest value that avoids OOM) forGeminiPlugin. Or settingcpu_offloadargument to True forLowLevelZeroPluginorHybridParallelPlugin. Their function is similar, offload optimizer states to cpu memory to avoid OOM in GPU. - you can set
enable_flash_attentionto True forGeminiPluginandHybridParallelPlugin, since flash attention will not only accelerate training but also save GPU memory