LLaVA
LLaVA copied to clipboard
CUDA Out of Memory during Training on 80GB A100
Thanks for your great work. I am trying to run experiments following your command on a slurm cluster with 80GB A100; But I got CUDA out of memory in backwarding the very first iteration. I wonder if anyone met the same problem? or if I misunderstood the instructions?
Script I tried to start the experiment on one GPU is as follows
srun -p A100 --gres=gpu:1 -n 1 --ntasks-per-node 1 --kill-on-bad-exit \
torchrun --nnodes=1 --nproc_per_node=1 --master_port=25001 \
llava/train/train_mem.py \
--model_name /path to /LLaVA_13B_v0/ \
--data_path ./LLaVA-Instruct-150K/conversation_58k.json \
--image_folder /path to /coco/train2017/ \
--vision_tower openai/clip-vit-large-patch14 \
--mm_vision_select_layer -2 \
--mm_use_im_start_end True \
--bf16 True \
--output_dir ./checkpoints \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
I have set per_device_batch_size to 1, but it still doesn't work. However, I can run the CLI inference. Error Messages are here.
0%| | 0/170043 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
File "/mnt/petrelfs/unified_benchmark/LLaMA-X/LLaVA/llava/train/train_mem.py", line 13, in <module>
train()
File "/mnt/petrelfs/unified_benchmark/LLaMA-X/LLaVA/llava/train/train.py", line 508, in train
trainer.train()
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1644, in train
return inner_training_loop(
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2675, in training_step
loss.backward()
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 79.35 GiB total capacity; 77.09 GiB already allocated; 103.19 MiB free; 77.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid f
ragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Thanks for help in advance!
same OOM here
same OOM here
Hi @JulioZhao97 , I wonder what do you mean by here, did you succeed in pretraining stage?
me too, especially when saving ckpt
same OOM here
Hi @JulioZhao97 , I wonder what do you mean by here, did you succeed in pretraining stage?
No, I directly run the second stage finetune, but the same OOM in 80GB A100. waiting for response from author.
Hi, thank you for your interest in our work.
Currently it is impossible to train the 13B model on a single 80G A100. You may refer to https://github.com/lm-sys/FastChat/issues/367 for a calculation of the size of GPU memory needed. We require a similar amount of VRAM as Vicuna training. You'll need multiple GPUs with FSDP to offload model parameters / optimizer to different GPUs.
We have recently updated FSDP for pretraining, so that helps with the pretraining a bit.
We are also working on parameter efficient tuning methods like LORA. Stay tuned on this, and contribution is welcomed!
Thanks.
Hi, thank you for your interest in our work.
Currently it is impossible to train the 13B model on a single 80G A100. You may refer to lm-sys/FastChat#367 for a calculation of the size of GPU memory needed. We require a similar amount of VRAM as Vicuna training. You'll need multiple GPUs with FSDP to offload model parameters / optimizer to different GPUs.
We have recently updated FSDP for pretraining, so that helps with the pretraining a bit.
We are also working on parameter efficient tuning methods like LORA. Stay tuned on this, and contribution is welcomed!
Thanks.
Got it! Thanks for your help. I'll try to involve more GPUs for training. To my understanding, the most recent code support FSDP training strategy, right?
Hi @wangjiongw, the most recent code supports FSDP for both pretraining and finetuning. There was an issue with the FSDP support with pretraining, that was fixed today.
We have also successfully moved the LLaVA implementation all into a single code base, so that we do not need to upgrade transformers
package for new features. Please upgrade to the latest code base and reinstall the transformers
package, following the instruction below. Thanks.
git pull
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .
Hi @wangjiongw, the most recent code supports FSDP for both pretraining and finetuning. There was an issue with the FSDP support with pretraining, that was fixed today.
We have also successfully moved the LLaVA implementation all into a single code base, so that we do not need to upgrade
transformers
package for new features. Please upgrade to the latest code base and reinstall thetransformers
package, following the instruction below. Thanks.git pull pip uninstall transformers pip install git+https://github.com/huggingface/transformers@cae78c46 pip install -e .
Got it! Thanks for your help and excellent work, I will try following the instructions. FSDP in your earlier reply has really helped a lot. Maybe I will close this issue and great appreciation again~