InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

CUDA out of memory

Open Ying-Kang opened this issue 1 year ago • 12 comments

when i want to continue finetune internvl chat, an error occured:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB

i specify the batchsize into 1 by modify the per_device_train_batch_size in start script: internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune_continue.sh

but the problem still occured

i'd appreciate it if you could help

thx

Ying-Kang avatar Apr 03 '24 02:04 Ying-Kang

by the way, i used 32 GPUs(V100 32g) with pytorch as launcher

Ying-Kang avatar Apr 03 '24 02:04 Ying-Kang

i put my start script here:

python internvl/train/internvl_chat_finetune.py \
--model_name_or_path "/xxx/InternVL/pretrained/InternVL-Chat-Chinese-V1-2-Plus" \
--conv_style "Hermes-2" \
--output_dir ${OUTPUT_DIR} \
--meta_path "internvl_chat/shell/data/zw_data.json" \
--overwrite_output_dir True \
--force_image_size 448 \
--down_sample_ratio 0.5 \
--drop_path_rate 0.0 \
--pad2square False \
--freeze_llm False \
--freeze_mlp False \
--freeze_backbone True \
--vision_select_layer -1 \
--use_data_resampling False \
--dataloader_num_workers 2 \
--fp16 True \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0.05 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--max_seq_length 2048 \
--group_by_length True \
--do_train True \
--grad_checkpoint True \
--deepspeed "zero_stage3_config.json" \
--report_to "tensorboard"

Ying-Kang avatar Apr 03 '24 02:04 Ying-Kang

我也遇到了同样的问题,我是有8张v100,PER_DEVICE_BATCH_SIZE设为1,BATCH_SIZE设为8

emmating12 avatar Apr 03 '24 06:04 emmating12

Hi, thanks for your attention. Finetuning a 34B LLM with 32GB GPUs might be somewhat challenging. You could consider trying to only fine-tune the MLP connector or Lora in the LLM.

czczup avatar Apr 07 '24 02:04 czczup

我也遇到了同样的问题,我是有8张v100,PER_DEVICE_BATCH_SIZE设为1,BATCH_SIZE设为8

您好,您可以先试试只微调MLP层,看看会不会OOM

--freeze_llm True \
--freeze_mlp False \
--freeze_backbone True \

czczup avatar Apr 07 '24 02:04 czczup

我也遇到了同样的问题,我是有8张v100,PER_DEVICE_BATCH_SIZE设为1,BATCH_SIZE设为8

您好,您可以先试试只微调MLP层,看看会不会OOM

--freeze_llm True \
--freeze_mlp False \
--freeze_backbone True \

您好,我下午按这样修改后,还是会OOM,我发现训练时显存只有显卡0的一直在涨然后爆掉,其他显卡显存一直没有用上是什么原因呢 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 GPUS=8 BATCH_SIZE=8 PER_DEVICE_BATCH_SIZE=1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh

emmating12 avatar Apr 07 '24 09:04 emmating12

噢噢,可能需要在shell上面加一行: export LAUNCHER=pytorch 麻烦您试试看

czczup avatar Apr 08 '24 03:04 czczup

噢噢,可能需要在shell上面加一行: export LAUNCHER=pytorch 麻烦您试试看

I still encounter this problem even if i change the scale from 448 into 224

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 280.00 MiB (GPU 0; 31.74 GiB total capacity; 29.09 GiB already allocated; 111.12 MiB free; 30.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my changed launch sript is as follow:

export LAUNCHER=pytorch
python internvl/train/internvl_chat_finetune.py --model_name_or_path "pretrained/InternVL-Chat-Chinese-V1-2-Plus" --conv_style "Hermes-2" --output_dir ${OUTPUT_DIR} --meta_path "internvl_chat/shell/data/zw_data.json" --overwrite_output_dir True --force_image_size 224 --down_sample_ratio 0.5 --drop_path_rate 0.0 --pad2square False --freeze_llm True --freeze_mlp False --freeze_backbone True --vision_select_layer -1 --use_data_resampling False --dataloader_num_workers 2 --fp16 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 1 --learning_rate 1e-5 --weight_decay 0.05 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_seq_length 2048 --group_by_length True --do_train True --grad_checkpoint True --deepspeed "zero_stage3_config.json" --report_to "tensorboard"

Ying-Kang avatar Apr 08 '24 07:04 Ying-Kang

Sorry, perhaps adjustments to the deepspeed zero3 configuration are needed (my default configuration may be more suitable for 80G A100 GPUs, keeping more models per GPU to reduce communication), I'm trying now to see if I can keep the memory usage within 32G.

czczup avatar Apr 08 '24 12:04 czczup

Sorry, perhaps adjustments to the deepspeed zero3 configuration are needed (my default configuration may be more suitable for 80G A100 GPUs, keeping more models per GPU to reduce communication), I'm trying now to see if I can keep the memory usage within 32G.

looking forward to your further reply

Ying-Kang avatar Apr 09 '24 01:04 Ying-Kang

Looking forward to finetuning on 32G VRAM GPU too!

chenming-wu avatar Apr 28 '24 03:04 chenming-wu

我也遇到了同样的问题,我是有8张v100,PER_DEVICE_BATCH_SIZE设为1,BATCH_SIZE设为8

您好,您可以先试试只微调MLP层,看看会不会OOM

--freeze_llm True \
--freeze_mlp False \
--freeze_backbone True \

您好,我下午按这样修改后,还是会OOM,我发现训练时显存只有显卡0的一直在涨然后爆掉,其他显卡显存一直没有用上是什么原因呢 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 GPUS=8 BATCH_SIZE=8 PER_DEVICE_BATCH_SIZE=1 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh

只有第一张卡在涨是不是因为没有用deepspeed啊,sh脚本里python运行指令前面有加deepspeed吗

13633491388 avatar May 08 '24 11:05 13633491388

I also encountered the same problem. I have 8 v100s, PER_DEVICE_BATCH_SIZE is set to 1, BATCH_SIZE is set to 8

Hi, you can try to fine-tune only the MLP layer to see if OOM occurs.

--freeze_llm True \
--freeze_mlp False \
--freeze_backbone True \

why only mlp? can you please explain this?

shiva-vardhineedi avatar Jul 12 '24 08:07 shiva-vardhineedi