ColossalAI
ColossalAI copied to clipboard
Can not train llama-7b due to OOM on 40GA100
GPU 40GA1008
I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is:
torchrun --standalone --nproc_per_node=4 examples/train_sft.py --pretrain "**********7B/llama-7b" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset *********/data/merged_file.json --batch_size 1 --accimulation_steps 4 --lr 2e-5 --max_epochs 1 --lora_rank 4
40G should be enough for Llama's 7B model. When I control the max_datasets_size like this
--max_datasets_size 4096
The training process will be done. But the usage of GPU memory is different at the beginning stage and near the end.
Beginning Stage
Ending Stage
Another question, I found that will confirm the W&B choice multi timesγ Can this process be simplified?
Add WANDB_MODE=disabled
before torchrun
Add
WANDB_MODE=disabled
before torchrun
Thanks! Do you have any suggestions for OOM issues?
same OOM question
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
same question
This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.
23/04/17 https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq we have provide a a low resources example. Thanks.
same question
same OOM question
same OOM question
Set placement_policy='cpu' can alleviate this question.
same oom
same OOM question
Set placement_policy='cpu' can alleviate this question.
How long will it cost to run train_sft.py only use CPUοΌ
Same issue here
A6000x2 with 48GB VRAM, got OOM too.
8*V100 32G got the same OOM
Heal our children!
we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same
if you only have less RAM GPU, you can use Gemini with cpu offload or lora
@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?
Train scriptοΌ
CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain xxx/llama-7b
--model 'llama'
--strategy colossalai_zero2
--log_interval 10
--save_path exp/Coati-7B
--dataset data/instinwild_en.json
--batch_size 1 \
--accimulation_steps 8
--lr 2e-5
--max_datasets_size 512 \
--max_epochs 1 \
This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.
We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~
This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.
We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~
For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster
we run our LLama 7B 4 * A100 80G, if you want to run it on 40 G A100, you can use a smaller batch size and expand accimulation_steps to make the total batch size same
A100 40GBοΌset batch size=1 also OOM~
This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.
We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn't works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~
For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster
Thanks~ I contact you in email, waiting for your reply.
same OOM question
same OOM problem on 32G V100 * 8
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
same OOM problem on 32G V100 * 8
same problem!!! can official team solve this bug ASAP!!!!
Hi guys, we will double-check it this week and give a detailed example for a limited resource. Thanks.
@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?
Train scriptοΌ
CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \ --pretrain xxx/llama-7b --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path exp/Coati-7B --dataset data/instinwild_en.json --batch_size 1 \ --accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 \ --max_epochs 1 \
It creates FP32 master weights after initializing trainer. However, FP32 master weights will be sharded if using multiple GPUs.
I've added "colossalai_zero2_cpu" strategy for this script. I tested on 4x 40GA100 and it works.
If you use this 'colossalai_zero2_cpu' strategy, how much slower than before 'colossalai_zero2' ? @ver217
@binmakeswell Is the strategy of 'colossalai_zero2_cpu' a complete solution for low-resource machines? Will there be other solutions