LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

CUDA Out of Memory during Training on 80GB A100

Open wangjiongw opened this issue 1 year ago • 6 comments

Thanks for your great work. I am trying to run experiments following your command on a slurm cluster with 80GB A100; But I got CUDA out of memory in backwarding the very first iteration. I wonder if anyone met the same problem? or if I misunderstood the instructions?

Script I tried to start the experiment on one GPU is as follows

srun -p A100 --gres=gpu:1 -n 1 --ntasks-per-node 1 --kill-on-bad-exit \
  torchrun --nnodes=1 --nproc_per_node=1 --master_port=25001 \
  llava/train/train_mem.py \
    --model_name /path to /LLaVA_13B_v0/ \
    --data_path ./LLaVA-Instruct-150K/conversation_58k.json \
    --image_folder /path to /coco/train2017/ \
    --vision_tower openai/clip-vit-large-patch14 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb 

I have set per_device_batch_size to 1, but it still doesn't work. However, I can run the CLI inference. Error Messages are here.

  0%|          | 0/170043 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
  File "/mnt/petrelfs/unified_benchmark/LLaMA-X/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/mnt/petrelfs/unified_benchmark/LLaMA-X/LLaVA/llava/train/train.py", line 508, in train
    trainer.train()
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1644, in train
    return inner_training_loop(
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2675, in training_step
    loss.backward()
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/mnt/cache/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 79.35 GiB total capacity; 77.09 GiB already allocated; 103.19 MiB free; 77.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid f
ragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thanks for help in advance!

wangjiongw avatar Apr 28 '23 02:04 wangjiongw

same OOM here

JulioZhao97 avatar Apr 28 '23 06:04 JulioZhao97

same OOM here

Hi @JulioZhao97 , I wonder what do you mean by here, did you succeed in pretraining stage?

wangjiongw avatar Apr 28 '23 07:04 wangjiongw

me too, especially when saving ckpt

yuezewang avatar Apr 28 '23 07:04 yuezewang

same OOM here

Hi @JulioZhao97 , I wonder what do you mean by here, did you succeed in pretraining stage?

No, I directly run the second stage finetune, but the same OOM in 80GB A100. waiting for response from author.

JulioZhao97 avatar Apr 28 '23 07:04 JulioZhao97

Hi, thank you for your interest in our work.

Currently it is impossible to train the 13B model on a single 80G A100. You may refer to https://github.com/lm-sys/FastChat/issues/367 for a calculation of the size of GPU memory needed. We require a similar amount of VRAM as Vicuna training. You'll need multiple GPUs with FSDP to offload model parameters / optimizer to different GPUs.

We have recently updated FSDP for pretraining, so that helps with the pretraining a bit.

We are also working on parameter efficient tuning methods like LORA. Stay tuned on this, and contribution is welcomed!

Thanks.

haotian-liu avatar Apr 29 '23 04:04 haotian-liu

Hi, thank you for your interest in our work.

Currently it is impossible to train the 13B model on a single 80G A100. You may refer to lm-sys/FastChat#367 for a calculation of the size of GPU memory needed. We require a similar amount of VRAM as Vicuna training. You'll need multiple GPUs with FSDP to offload model parameters / optimizer to different GPUs.

We have recently updated FSDP for pretraining, so that helps with the pretraining a bit.

We are also working on parameter efficient tuning methods like LORA. Stay tuned on this, and contribution is welcomed!

Thanks.

Got it! Thanks for your help. I'll try to involve more GPUs for training. To my understanding, the most recent code support FSDP training strategy, right?

wangjiongw avatar Apr 29 '23 16:04 wangjiongw

Hi @wangjiongw, the most recent code supports FSDP for both pretraining and finetuning. There was an issue with the FSDP support with pretraining, that was fixed today.

We have also successfully moved the LLaVA implementation all into a single code base, so that we do not need to upgrade transformers package for new features. Please upgrade to the latest code base and reinstall the transformers package, following the instruction below. Thanks.

git pull
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .

haotian-liu avatar May 01 '23 01:05 haotian-liu

Hi @wangjiongw, the most recent code supports FSDP for both pretraining and finetuning. There was an issue with the FSDP support with pretraining, that was fixed today.

We have also successfully moved the LLaVA implementation all into a single code base, so that we do not need to upgrade transformers package for new features. Please upgrade to the latest code base and reinstall the transformers package, following the instruction below. Thanks.

git pull
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .

Got it! Thanks for your help and excellent work, I will try following the instructions. FSDP in your earlier reply has really helped a lot. Maybe I will close this issue and great appreciation again~

wangjiongw avatar May 01 '23 07:05 wangjiongw