stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

OOM issue

Open puyuanliu opened this issue 2 years ago • 8 comments
trafficstars

Can this finetuning script fit into A10, which only has 24GB GPU memory? I am trying to fine-tune the model on 4 A10 GPUs using a batch size of 1, but I still get an OOM error.

puyuanliu avatar Mar 16 '23 06:03 puyuanliu

Just tried using 8 A100 (40GB), still having OOM issue after one iteration:

{'loss': 1.6692, 'learning_rate': 1.360544217687075e-07, 'epoch': 0.0}
0%| | 1/4875 [00:06<7:54:55, 5.85s/it]Traceback (most recent call last): File "stanford_alpaca/train.py", line 235, in Traceback (most recent call last): File "stanford_alpaca/train.py", line 235, in train() File "stanford_alpaca/train.py", line 228, in train trainer.train() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/transformers/trainer.py", line 1900, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/transformers/trainer.py", line 2662, in training_step loss.backward() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 388.00 MiB (GPU 2; 39.59 GiB total capacity; 36.74 GiB already allocated; 120.19 MiB free; 37.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

puyuanliu avatar Mar 16 '23 06:03 puyuanliu

4x A100-80GB is ok, and gpu memory is almost fully allocated...

kriskrisliu avatar Mar 16 '23 07:03 kriskrisliu

4x A100-80GB is ok, and gpu memory is almost fully allocated...

can i train using only 1 A100-80G?

leondelee avatar Mar 16 '23 07:03 leondelee

4x A100-80GB is ok, and gpu memory is almost fully allocated...

can i train using only 1 A100-80G?

i've no idea. maybe you can have a try

kriskrisliu avatar Mar 16 '23 08:03 kriskrisliu

8 * A100 (40GB) worked after using fp16 instead of bp16, batch_size = 1

puyuanliu avatar Mar 16 '23 14:03 puyuanliu

With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16?

KurtFeynmanGodel avatar Mar 16 '23 14:03 KurtFeynmanGodel

Yes I was using grad accumulation of 8 with batch size 1.

KurtFeynmanGodel @.***> 于 2023年3月16日周四 08:21写道:

With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16?

— Reply to this email directly, view it on GitHub https://github.com/tatsu-lab/stanford_alpaca/issues/46#issuecomment-1472074940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN5FWS3FILVZDZW4LLPFOTW4MOWJANCNFSM6AAAAAAV4Y32UU . You are receiving this because you authored the thread.Message ID: @.***>

puyuanliu avatar Mar 16 '23 14:03 puyuanliu

Yes I was using grad accumulation of 8 with batch size 1. KurtFeynmanGodel @.> 于 2023年3月16日周四 08:21写道: With the same configuration and fp16 I could do batch size of 2 but with gradient accumulation of 1. Did you use gradient accumulation of 8 with bp16? — Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN5FWS3FILVZDZW4LLPFOTW4MOWJANCNFSM6AAAAAAV4Y32UU . You are receiving this because you authored the thread.Message ID: @.>

Nice! I have two questions.

  1. Did you ever get to resume training from checkpoint? I could not I was getting some error. IF you successfully did, can you share the changes you made to the code? I only tried adding the checkpoint path to the train function. But I am not familiar with the full sharded parallel stuff.
  2. How long did your training take? Mine took about 8 hours (batch size of 2 with grad acc of 1). I am assuming yours should take less time maybe 4-6 hours?

KurtFeynmanGodel avatar Mar 16 '23 15:03 KurtFeynmanGodel

No, I haven't tried to resume training from the checkpoint. Mine is roughly 6 hours. But I got OOM error after the last epoch.

puyuanliu avatar Mar 16 '23 18:03 puyuanliu

No, I haven't tried to resume training from the checkpoint. Mine is roughly 6 hours. But I got OOM error after the last epoch.

Could you please share your python version and pytorch version? I run in 4 * A100 (40GB) with batch_size=1, fp16=True and gradient_accumulation_steps=1, but I still got OOM in the first epoch. I suspect that there's something wrong with my envirenment. Thanks a lot.

yysjasmine avatar Mar 17 '23 14:03 yysjasmine

@yysjasmine Try this command. torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

puyuanliu avatar Mar 17 '23 16:03 puyuanliu

LLaMA-13B (HF) Fails with OOM on a dual A100-80GB. torchrun --nproc_per_node=2 --master_port=9999 train.py --model_name_or_path ../llama-13b/ --data_path ./alpaca_data.json --fp16 True --output_dir ../alpaca-13b/ --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer

(also tested with nproc=1, also fails with OOM)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 606.00 MiB (GPU 1; 79.18 GiB total capacity; 77.07 GiB already allocated; 468.31 MiB free; 77.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Mem ory Management and PYTORCH_CUDA_ALLOC_CONF

jtang613 avatar Mar 18 '23 16:03 jtang613

@yysjasmine Try this command. torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

Thank you, I will try the command.

yysjasmine avatar Mar 20 '23 03:03 yysjasmine

@yysjasmine did the command work with 4 A100 ?

Ahtesham00 avatar Apr 07 '23 12:04 Ahtesham00