ColossalAI Can not train llama-7b due to OOM on 40GA100

GPU 40GA1008

I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is:

torchrun --standalone --nproc_per_node=4 examples/train_sft.py --pretrain "**********7B/llama-7b" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset *********/data/merged_file.json --batch_size 1 --accimulation_steps 4 --lr 2e-5 --max_epochs 1 --lora_rank 4

40G should be enough for Llama's 7B model. When I control the max_datasets_size like this --max_datasets_size 4096 The training process will be done. But the usage of GPU memory is different at the beginning stage and near the end.

Beginning Stage Ending Stage

Another question, I found that will confirm the W&B choice multi times。 Can this process be simplified?

Mar 30 '23 03:03 lurenlym

Add WANDB_MODE=disabled before torchrun

Mar 30 '23 10:03 qijiaxing

Add WANDB_MODE=disabled before torchrun

Thanks! Do you have any suggestions for OOM issues?

Mar 31 '23 01:03 lurenlym

same OOM question

Mar 31 '23 01:03 okzhili

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

same question

Mar 31 '23 01:03 Issues-translate-bot

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

23/04/17 https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq we have provide a a low resources example. Thanks.

Mar 31 '23 03:03 FrankLeeeee

same question

Apr 01 '23 12:04 MrRace

same OOM question

Apr 03 '23 01:04 tiandongtao

same OOM question

Set placement_policy='cpu' can alleviate this question.

Apr 03 '23 01:04 okzhili

same oom

Apr 03 '23 04:04 yhaiqiang

same OOM question

Set placement_policy='cpu' can alleviate this question.

How long will it cost to run train_sft.py only use CPU？

Apr 03 '23 05:04 MrRace

Same issue here

Apr 03 '23 13:04 alibabadoufu

A6000x2 with 48GB VRAM, got OOM too.

Apr 04 '23 07:04 leonselina

8*V100 32G got the same OOM

Apr 05 '23 04:04 balcklive

Heal our children!

Apr 06 '23 02:04 MrRace

we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same

Apr 06 '23 06:04 Fazziekey

if you only have less RAM GPU, you can use Gemini with cpu offload or lora

Apr 06 '23 06:04 Fazziekey

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script：

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain xxx/llama-7b
--model 'llama'
--strategy colossalai_zero2
--log_interval 10
--save_path exp/Coati-7B
--dataset data/instinwild_en.json
--batch_size 1 \
--accimulation_steps 8
--lr 2e-5
--max_datasets_size 512 \ --max_epochs 1 \

Apr 06 '23 07:04 fan-niu

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

Apr 06 '23 08:04 penghaozhou

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

Apr 06 '23 08:04 Fazziekey

we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same

A100 40GB，set batch size=1 also OOM~

Apr 06 '23 09:04 MrRace

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

Thanks~ I contact you in email, waiting for your reply.

Apr 07 '23 02:04 penghaozhou

same OOM question

Apr 07 '23 22:04 RoeeXu

same OOM problem on 32G V100 * 8

Apr 09 '23 04:04 mcc311

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

same OOM problem on 32G V100 * 8

Apr 09 '23 04:04 Issues-translate-bot

same problem!!! can official team solve this bug ASAP!!!!

Apr 09 '23 14:04 janglichao

Hi guys, we will double-check it this week and give a detailed example for a limited resource. Thanks.

Apr 10 '23 03:04 binmakeswell

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script：

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \ --pretrain xxx/llama-7b --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path exp/Coati-7B --dataset data/instinwild_en.json --batch_size 1 \ --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 \ --max_epochs 1 \

It creates FP32 master weights after initializing trainer. However, FP32 master weights will be sharded if using multiple GPUs.

Apr 10 '23 10:04 ver217

I've added "colossalai_zero2_cpu" strategy for this script. I tested on 4x 40GA100 and it works.

Apr 11 '23 02:04 ver217

If you use this 'colossalai_zero2_cpu' strategy, how much slower than before 'colossalai_zero2' ? @ver217

Apr 11 '23 03:04 tianbuwei

@binmakeswell Is the strategy of 'colossalai_zero2_cpu' a complete solution for low-resource machines? Will there be other solutions

Apr 11 '23 03:04 tianbuwei

ColossalAI ColossalAI copied to clipboard

Can not train llama-7b due to OOM on 40GA100

GPU 40GA1008

ColossalAI
ColossalAI copied to clipboard