internvl35 OOM
transformers==4.52.1
torchrun
internvl/train/internvl_chat_finetune.py
--model_name_or_path ${MODEL_PATH}
--conv_style "internvl2_5"
--use_liger True
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path ${META_PATH}
--overwrite_output_dir False
--force_image_size 112
--max_dynamic_patch 1
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp False
--freeze_backbone True
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 True
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--save_strategy "steps"
--save_steps 50
--save_total_limit 1
--learning_rate 1e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 32768
--split_annotations True
--do_train True
--grad_checkpoint True
--gradient_checkpointing True
--group_by_length False
--dynamic_image_size False
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn False
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
--do_eval False
--eval_strategy "steps"
--eval_steps 300
--label_names labels
--prediction_loss_only True
--bf16_full_eval True
--per_device_eval_batch_size ${PER_DEVICE_BATCH_SIZE}
--num_train_epochs ${EPOCHS}
--use_packed_ds False
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20
--log_freq 200
--strict_mode False
--replacement True
--allow_overflow False
--remove_unused_columns False
--seed 42
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
--freeze_llm True
--freeze_mlp False
--freeze_backbone True
我已经用了112的分辨率读video 依旧会OOM, 但是同样的2b模型internvl3不会这样, 请教是什么原因?
不知道是否与此有关?
[INFO|trainer.py:2409] 2025-09-05 03:48:04,390 >> ***** Running training ***** [INFO|trainer.py:2410] 2025-09-05 03:48:04,390 >> Num examples = 13,102 [INFO|trainer.py:2411] 2025-09-05 03:48:04,390 >> Num Epochs = 1 [INFO|trainer.py:2412] 2025-09-05 03:48:04,390 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2415] 2025-09-05 03:48:04,390 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2416] 2025-09-05 03:48:04,390 >> Gradient Accumulation steps = 16 [INFO|trainer.py:2417] 2025-09-05 03:48:04,390 >> Total optimization steps = 819 [INFO|trainer.py:2418] 2025-09-05 03:48:04,396 >> Number of trainable parameters = 69,730,304 2025-09-05 03:48:04,410 - INFO - vision_config is None. Initializing the InternVisionConfig with default values. 2025-09-05 03:48:04,410 - INFO - llm_config is None. Initializing the LlamaConfig config with default values (LlamaConfig). 2025-09-05 03:48:04,410 - INFO - vision_select_layer: -1 2025-09-05 03:48:04,410 - INFO - ps_version: v1 2025-09-05 03:48:04,410 - INFO - min_dynamic_patch: 1 2025-09-05 03:48:04,410 - INFO - max_dynamic_patch: 6
Can you share more detailed information, such as the exact position the OOM error occurs?