ColossalAI [BUG]: train_sft.py LLAMA-7B colossalai

[BUG]: train_sft.py LLAMA-7B colossalai_zero2 OOM

Open TyrionZK opened this issue 1 year ago • 4 comments

🐛 Describe the bug

I run train_sft.py to finetune LLAMA-7B with colossalai_zero2 batch_size=1 max_len=512，but OOM happen.

theoretically, memory usage of single GPU is about (2+(2+12)/4)*7=38.5G, plus usage for model input.

I don't know why. It is so different from official experiment.

Environment

my device: 4*A800 80G

Oct 12 '23 09:10 TyrionZK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BOG]: Transv.B for the lover SLSZER or

Oct 12 '23 09:10 Issues-translate-bot

Thank you for your feedback😃Could you tell me the steps to reproduce it? such as the scripts you run, the procedure to configure the environment, etc. I can reproduce the bug on our A100 machine and help with it.

Oct 13 '23 03:10 Orion-Zheng

I just run the offical code https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/train_sft.py

Oct 13 '23 08:10 TyrionZK

Thank you for your concern. We will address this issue as soon as possible.

Oct 24 '23 03:10 flybird11111

ColossalAI ColossalAI copied to clipboard

[BUG]: train_sft.py LLAMA-7B colossalai_zero2 OOM

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard