ColossalAI [BUG]: Failed to train the OPT-13b model due to OOM

🐛 Describe the bug

I am running the example in "examples/tutorial/opt/opt/run_clm.sh" to train a OPT-13b model. But it failed due to OOM. Could you give a report about how large the model can be trained with ColossalAI?

Parameters: BS=1 MEMCAP=0 GPUNUM=8

Error info: RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 31.75 GiB total capacity; 24.62 GiB already allocated; 1.06 GiB free; 27.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFret = input.softmax(dim, dtype=dtype)

Environment

8 * V100 GPU or 16 * V100 GPU

Mar 04 '23 09:03 hibayesian

Hi @hibayesian This path is an abbreviated tutorial prepared for specific activities and may not be maintained in real time. You should use the OPT example here https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/opt Thanks.

Mar 07 '23 05:03 binmakeswell

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

Apr 27 '23 09:04 binmakeswell