[BUG]: Failed to train the OPT-13b model due to OOM
🐛 Describe the bug
I am running the example in "examples/tutorial/opt/opt/run_clm.sh" to train a OPT-13b model. But it failed due to OOM. Could you give a report about how large the model can be trained with ColossalAI?
Parameters: BS=1 MEMCAP=0 GPUNUM=8
Error info: RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 31.75 GiB total capacity; 24.62 GiB already allocated; 1.06 GiB free; 27.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFret = input.softmax(dim, dtype=dtype)
Environment
8 * V100 GPU or 16 * V100 GPU
Hi @hibayesian This path is an abbreviated tutorial prepared for specific activities and may not be maintained in real time. You should use the OPT example here https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/opt Thanks.
We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.