InternVL [Bug] How to finetune InternVL2-40B through `deepspeed`

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我使用deepspeed框架在16*A100对InternVL2-40B调试时，出现OOM的错误。

Reproduction

set -x

PARTITION=${PARTITION:-"INTERN2"}
GPUS=16
GPUS_PER_NODE=8
QUOTA_TYPE=reserved
NODES=$((GPUS / GPUS_PER_NODE))
CPUS_PER_TASK=2
SRUN_ARGS=${SRUN_ARGS:-""}
BATCH_SIZE=64
PER_DEVICE_BATCH_SIZE=1
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))


export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export LAUNCHER=pytorch
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3

OUTPUT_DIR='/xxxx/xxxxx/'

if [ ! -d "$OUTPUT_DIR" ]; then
  mkdir -p "$OUTPUT_DIR"
fi

NUM_PROCESSES=`echo ${CUDA_VISIBLE_DEVICES} | awk -F ',' '{print NF}'`
if [ ${NUM_PROCESSES} -eq 0 ]; then
NUM_PROCESSES=`echo ${NVIDIA_VISIBLE_DEVICES} | awk -F ',' '{print NF}'`
fi

# number of gpus: 16
# batch size per gpu: 2
# gradient accumulation steps: 4
# total batch size: 128
# epoch: 1
deepspeed --num_gpus ${NUM_PROCESSES} --master_port=${MASTER_PORT} \
  internvl/train/internvl_chat_finetune.py \
  --model_name_or_path "./InternVL2-40B" \
  --conv_style "Hermes-2" \
  --output_dir ${OUTPUT_DIR} \
  --meta_path "xxxxxxxx" \
  --overwrite_output_dir True \
  --force_image_size 448 \
  --max_dynamic_patch 6 \
  --down_sample_ratio 0.5 \
  --drop_path_rate 0.4 \
  --freeze_llm False \
  --freeze_mlp False \
  --freeze_backbone True \
  --vision_select_layer -1 \
  --dataloader_num_workers 4 \
  --bf16 True \
  --num_train_epochs 1 \
  --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \
  --gradient_accumulation_steps ${GRADIENT_ACC} \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 200 \
  --save_total_limit 1 \
  --learning_rate 2e-5 \
  --weight_decay 0.05 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --max_seq_length 4096 \
  --do_train True \
  --grad_checkpoint True \
  --group_by_length True \
  --dynamic_image_size True \
  --use_thumbnail True \
  --ps_version 'v2' \
  --deepspeed "zero_stage3_config_34b.json" \
  --report_to "tensorboard" \
  2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Environment

和原仓库一致

Error traceback

`SCRIPT`

[2024-07-31 14:58:45,219] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 608
--
[2024-07-31 14:58:46,166] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 609
main()
File "internvl/train/internvl_chat_finetune.py", line 640, in main
main()
File "internvl/train/internvl_chat_finetune.py", line 640, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
return inner_training_loop(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2781, in training_step
 
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2781, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.engine.step()self.engine.step()
 
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
self._optimizer_step(sub_group_id)
self._optimizer_step(sub_group_id)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 947, in _optimizer_step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 947, in _optimizer_step
self.optimizer.step()
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
out = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 157, in step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 157, in step
state['exp_avg_sq'] = torch.zeros_like(p.data)state['exp_avg_sq'] = torch.zeros_like(p.data)
 
torch.cudatorch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.78 GiB (GPU 5; 79.35 GiB total capacity; 65.64 GiB already allocated; 3.71 GiB free; 73.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.78 GiB (GPU 0; 79.35 GiB total capacity; 65.64 GiB already allocated; 3.44 GiB free; 73.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

Jul 31 '24 07:07 smurf-1119

"I also encountered the same bug use torch run, training Lora for Internvl2 on 8 A800s resulted in an OOM (Out Of Memory) issue."

Aug 05 '24 06:08 HarrypoterR

你好，请问这个问题解决了吗，您可以先试试看微调InternVL2-26B的模型？

Hello, has this issue been resolved? You could try fine-tuning the InternVL2-26B model first.

Aug 26 '24 05:08 czczup

Due to the inactivity over the past two weeks, this issue might have already been resolved, so I will close it. If you have any further questions, please feel free to reopen it.

Sep 07 '24 10:09 czczup

你好，请问这个问题解决了吗，您可以先试试看微调InternVL2-26B的模型？

Hello, has this issue been resolved? You could try fine-tuning the InternVL2-26B model first.

Yes, the OOM issue has been resolved. It can be addressed by either allocating more memory during internvl_chat_finetune or by using internvl_chat_pretrain.py for training.

Sep 12 '24 07:09 HarrypoterR