ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Can not train llama-7b due to OOM on 40GA100

Open lurenlym opened this issue 1 year ago β€’ 48 comments

GPU 40GA1008

I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is:

torchrun --standalone --nproc_per_node=4 examples/train_sft.py --pretrain "**********7B/llama-7b" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset *********/data/merged_file.json --batch_size 1 --accimulation_steps 4 --lr 2e-5 --max_epochs 1 --lora_rank 4 image

40G should be enough for Llama's 7B model. When I control the max_datasets_size like this --max_datasets_size 4096 The training process will be done. But the usage of GPU memory is different at the beginning stage and near the end.

Beginning Stage image Ending Stage image

Another question, I found that will confirm the W&B choice multi times。 Can this process be simplified?

image

lurenlym avatar Mar 30 '23 03:03 lurenlym

Add WANDB_MODE=disabled before torchrun

qijiaxing avatar Mar 30 '23 10:03 qijiaxing

Add WANDB_MODE=disabled before torchrun

Thanks! Do you have any suggestions for OOM issues?

lurenlym avatar Mar 31 '23 01:03 lurenlym

same OOM question

okzhili avatar Mar 31 '23 01:03 okzhili

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


same question

Issues-translate-bot avatar Mar 31 '23 01:03 Issues-translate-bot

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

23/04/17 https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq we have provide a a low resources example. Thanks.

FrankLeeeee avatar Mar 31 '23 03:03 FrankLeeeee

same question

MrRace avatar Apr 01 '23 12:04 MrRace

same OOM question

tiandongtao avatar Apr 03 '23 01:04 tiandongtao

same OOM question

Set placement_policy='cpu' can alleviate this question.

okzhili avatar Apr 03 '23 01:04 okzhili

same oom

yhaiqiang avatar Apr 03 '23 04:04 yhaiqiang

same OOM question

Set placement_policy='cpu' can alleviate this question.

How long will it cost to run train_sft.py only use CPU?

MrRace avatar Apr 03 '23 05:04 MrRace

Same issue here

alibabadoufu avatar Apr 03 '23 13:04 alibabadoufu

A6000x2 with 48GB VRAM, got OOM too.

leonselina avatar Apr 04 '23 07:04 leonselina

8*V100 32G got the same OOM

balcklive avatar Apr 05 '23 04:04 balcklive

Heal our children!

MrRace avatar Apr 06 '23 02:04 MrRace

we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same

Fazziekey avatar Apr 06 '23 06:04 Fazziekey

if you only have less RAM GPU, you can use Gemini with cpu offload or lora

Fazziekey avatar Apr 06 '23 06:04 Fazziekey

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script:

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain xxx/llama-7b
--model 'llama'
--strategy colossalai_zero2
--log_interval 10
--save_path exp/Coati-7B
--dataset data/instinwild_en.json
--batch_size 1 \
--accimulation_steps 8
--lr 2e-5
--max_datasets_size 512 \ --max_epochs 1 \

fan-niu avatar Apr 06 '23 07:04 fan-niu

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

penghaozhou avatar Apr 06 '23 08:04 penghaozhou

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

Fazziekey avatar Apr 06 '23 08:04 Fazziekey

we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same

A100 40GB,set batch size=1 also OOM~

MrRace avatar Apr 06 '23 09:04 MrRace

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

Thanks~ I contact you in email, waiting for your reply.

penghaozhou avatar Apr 07 '23 02:04 penghaozhou

same OOM question

RoeeXu avatar Apr 07 '23 22:04 RoeeXu

same OOM problem on 32G V100 * 8

mcc311 avatar Apr 09 '23 04:04 mcc311

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


same OOM problem on 32G V100 * 8

Issues-translate-bot avatar Apr 09 '23 04:04 Issues-translate-bot

same problem!!! can official team solve this bug ASAP!!!!

janglichao avatar Apr 09 '23 14:04 janglichao

Hi guys, we will double-check it this week and give a detailed example for a limited resource. Thanks.

binmakeswell avatar Apr 10 '23 03:04 binmakeswell

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script:

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \ --pretrain xxx/llama-7b --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path exp/Coati-7B --dataset data/instinwild_en.json --batch_size 1 \ --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 \ --max_epochs 1 \

It creates FP32 master weights after initializing trainer. However, FP32 master weights will be sharded if using multiple GPUs.

ver217 avatar Apr 10 '23 10:04 ver217

I've added "colossalai_zero2_cpu" strategy for this script. I tested on 4x 40GA100 and it works.

ver217 avatar Apr 11 '23 02:04 ver217

If you use this 'colossalai_zero2_cpu' strategy, how much slower than before 'colossalai_zero2' ? @ver217

tianbuwei avatar Apr 11 '23 03:04 tianbuwei

@binmakeswell Is the strategy of 'colossalai_zero2_cpu' a complete solution for low-resource machines? Will there be other solutions

tianbuwei avatar Apr 11 '23 03:04 tianbuwei