stanford_alpaca
stanford_alpaca copied to clipboard
OOM issue
Can this finetuning script fit into A10, which only has 24GB GPU memory? I am trying to fine-tune the model on 4 A10 GPUs using a batch size of 1, but I still get an OOM error.
Just tried using 8 A100 (40GB), still having OOM issue after one iteration:
{'loss': 1.6692, 'learning_rate': 1.360544217687075e-07, 'epoch': 0.0}
0%| | 1/4875 [00:06<7:54:55, 5.85s/it]Traceback (most recent call last):
File "stanford_alpaca/train.py", line 235, in
4x A100-80GB is ok, and gpu memory is almost fully allocated...
4x A100-80GB is ok, and gpu memory is almost fully allocated...
can i train using only 1 A100-80G?
4x A100-80GB is ok, and gpu memory is almost fully allocated...
can i train using only 1 A100-80G?
i've no idea. maybe you can have a try
8 * A100 (40GB) worked after using fp16 instead of bp16, batch_size = 1
With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16?
Yes I was using grad accumulation of 8 with batch size 1.
KurtFeynmanGodel @.***> 于 2023年3月16日周四 08:21写道:
With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16?
— Reply to this email directly, view it on GitHub https://github.com/tatsu-lab/stanford_alpaca/issues/46#issuecomment-1472074940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN5FWS3FILVZDZW4LLPFOTW4MOWJANCNFSM6AAAAAAV4Y32UU . You are receiving this because you authored the thread.Message ID: @.***>
Yes I was using grad accumulation of 8 with batch size 1. KurtFeynmanGodel @.> 于 2023年3月16日周四 08:21写道: … With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16? — Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN5FWS3FILVZDZW4LLPFOTW4MOWJANCNFSM6AAAAAAV4Y32UU . You are receiving this because you authored the thread.Message ID: @.>
Nice! I have two questions.
- Did you ever get to resume training from checkpoint? I could not I was getting some error. IF you successfully did, can you share the changes you made to the code? I only tried adding the checkpoint path to the train function. But I am not familiar with the full sharded parallel stuff.
- How long did your training take? Mine took about 8 hours (batch size of 2 with grad acc of 1). I am assuming yours should take less time maybe 4-6 hours?
No, I haven't tried to resume training from the checkpoint. Mine is roughly 6 hours. But I got OOM error after the last epoch.
No, I haven't tried to resume training from the checkpoint. Mine is roughly 6 hours. But I got OOM error after the last epoch.
Could you please share your python version and pytorch version? I run in 4 * A100 (40GB) with batch_size=1, fp16=True and gradient_accumulation_steps=1, but I still got OOM in the first epoch. I suspect that there's something wrong with my envirenment. Thanks a lot.
@yysjasmine Try this command. torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
LLaMA-13B (HF) Fails with OOM on a dual A100-80GB.
torchrun --nproc_per_node=2 --master_port=9999 train.py --model_name_or_path ../llama-13b/ --data_path ./alpaca_data.json --fp16 True --output_dir ../alpaca-13b/ --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer
(also tested with nproc=1, also fails with OOM)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 606.00 MiB (GPU 1; 79.18 GiB total capacity; 77.07 GiB already allocated; 468.31 MiB free; 77.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Mem ory Management and PYTORCH_CUDA_ALLOC_CONF
@yysjasmine Try this command. torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
Thank you, I will try the command.
@yysjasmine did the command work with 4 A100 ?