[Bug] How to finetune InternVL2-40B through `deepspeed`
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
我使用deepspeed框架在16*A100对InternVL2-40B调试时,出现OOM的错误。
Reproduction
set -x
PARTITION=${PARTITION:-"INTERN2"}
GPUS=16
GPUS_PER_NODE=8
QUOTA_TYPE=reserved
NODES=$((GPUS / GPUS_PER_NODE))
CPUS_PER_TASK=2
SRUN_ARGS=${SRUN_ARGS:-""}
BATCH_SIZE=64
PER_DEVICE_BATCH_SIZE=1
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export LAUNCHER=pytorch
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3
OUTPUT_DIR='/xxxx/xxxxx/'
if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi
NUM_PROCESSES=`echo ${CUDA_VISIBLE_DEVICES} | awk -F ',' '{print NF}'`
if [ ${NUM_PROCESSES} -eq 0 ]; then
NUM_PROCESSES=`echo ${NVIDIA_VISIBLE_DEVICES} | awk -F ',' '{print NF}'`
fi
# number of gpus: 16
# batch size per gpu: 2
# gradient accumulation steps: 4
# total batch size: 128
# epoch: 1
deepspeed --num_gpus ${NUM_PROCESSES} --master_port=${MASTER_PORT} \
internvl/train/internvl_chat_finetune.py \
--model_name_or_path "./InternVL2-40B" \
--conv_style "Hermes-2" \
--output_dir ${OUTPUT_DIR} \
--meta_path "xxxxxxxx" \
--overwrite_output_dir True \
--force_image_size 448 \
--max_dynamic_patch 6 \
--down_sample_ratio 0.5 \
--drop_path_rate 0.4 \
--freeze_llm False \
--freeze_mlp False \
--freeze_backbone True \
--vision_select_layer -1 \
--dataloader_num_workers 4 \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \
--gradient_accumulation_steps ${GRADIENT_ACC} \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0.05 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--max_seq_length 4096 \
--do_train True \
--grad_checkpoint True \
--group_by_length True \
--dynamic_image_size True \
--use_thumbnail True \
--ps_version 'v2' \
--deepspeed "zero_stage3_config_34b.json" \
--report_to "tensorboard" \
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
Environment
和原仓库一致
Error traceback
`SCRIPT`
[2024-07-31 14:58:45,219] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 608
--
[2024-07-31 14:58:46,166] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 609
main()
File "internvl/train/internvl_chat_finetune.py", line 640, in main
main()
File "internvl/train/internvl_chat_finetune.py", line 640, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
return inner_training_loop(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2781, in training_step
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2781, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.engine.step()self.engine.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
self._optimizer_step(sub_group_id)
self._optimizer_step(sub_group_id)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 947, in _optimizer_step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 947, in _optimizer_step
self.optimizer.step()
self.optimizer.step()
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
out = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 157, in step
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 157, in step
state['exp_avg_sq'] = torch.zeros_like(p.data)state['exp_avg_sq'] = torch.zeros_like(p.data)
torch.cudatorch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.78 GiB (GPU 5; 79.35 GiB total capacity; 65.64 GiB already allocated; 3.71 GiB free; 73.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.78 GiB (GPU 0; 79.35 GiB total capacity; 65.64 GiB already allocated; 3.44 GiB free; 73.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
"I also encountered the same bug use torch run, training Lora for Internvl2 on 8 A800s resulted in an OOM (Out Of Memory) issue."
你好,请问这个问题解决了吗,您可以先试试看微调InternVL2-26B的模型?
Hello, has this issue been resolved? You could try fine-tuning the InternVL2-26B model first.
Due to the inactivity over the past two weeks, this issue might have already been resolved, so I will close it. If you have any further questions, please feel free to reopen it.
你好,请问这个问题解决了吗,您可以先试试看微调InternVL2-26B的模型?
Hello, has this issue been resolved? You could try fine-tuning the InternVL2-26B model first.
Yes, the OOM issue has been resolved. It can be addressed by either allocating more memory during internvl_chat_finetune or by using internvl_chat_pretrain.py for training.