InternVL [Bug] internvl的MPO的loss一开始就为0

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) dynamic ViT batch size: 30, images per sample: 7.5, dynamic token length: 1024 warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (7680) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([7680, 4096]) warning: The size of tensor a (3932) must match the size of tensor b (10240) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([3932, 4096]), vit_embeds.shape=torch.Size([10240, 4096]) [2025-02-08 10:58:15,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 78.92 | optimizer_gradients: 28.46 | optimizer_step: 48.12 [2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.09 | bwd_microstep: 1605.45 | bwd_inner_microstep: 1451.34 | bwd_allreduce_microstep: 154.04 | step_microstep: 188.06 [2025-02-08 10:58:15,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 496.08 | bwd: 1605.44 | bwd_inner: 1451.34 | bwd_allreduce: 154.05 | step: 188.06

1%| | 21/3306 [02:23<2:38:16, 2.89s/it]02/08/2025 10:58:15 - WARNING - tensorboardX.x2num - NaN or Inf found in input tensor.

{'loss': 0.0, 'learning_rate': 1.0500000000000001e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': 0.0, 'logps/chosen': 0.0, 'logits/rejected': 6.465516090393066, 'logits/chosen': 6.095698833465576, 'nll_loss': nan, 'epoch': 0.02}

Reproduction

set -x

GPUS=${GPUS:-4} GPUS_PER_NODE=${GPUS_PER_NODE:-1} NODES=$((GPUS / GPUS_PER_NODE)) CPUS_PER_TASK=${CPUS_PER_TASK:-10} SRUN_ARGS=${SRUN_ARGS:-""} BATCH_SIZE=${BATCH_SIZE:-8} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-2} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))

cd /mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat source /opt/conda/bin/activate

conda activate /mnt/pfs-mc0p4k/tts/team/zgx/environment/internvl2 echo "Python path: $(which python)" >> "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/shell/train_log.txt" which python

export PYTHONPATH="${PYTHONPATH}:$(pwd)" export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch

OUTPUT_DIR='/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/internvl_chat_mpo_v2/internvl2_8b_mpo_v1'

if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi

torchrun
--nnodes=1
--node_rank=0
--master_addr=0.0.0.0
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_dpo.py
--model_name_or_path "/mnt/pfs-mc0p4k/tts/team/zgx/workplace/internVL2/finetuning/InternVL/internvl_chat/output/merged_model/internvl2_8b_v1"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/adqa_mpo.json"
--overwrite_output_dir True
--force_image_size 448
--down_sample_ratio 0.5
--drop_path_rate 0.1
--pad2square False
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--use_data_resampling False
--dataloader_num_workers 8
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "no"
--save_steps 100
--save_total_limit 100
--learning_rate 5e-6
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 1024
--do_train True
--grad_checkpoint True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
--loss_type sigmoid,bco_pair
--sigmoid_loss_weight 0.8
--bco_pair_loss_weight 0.2
--rpo_alpha 1
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Environment

torch2.0.1+cu118

Error traceback

Feb 08 '25 03:02 amoreZgx1n

感觉是因为force_image_size 和max_seq_length 不匹配，448对应的应该是8192，我设置448，4096也会出现loss为0 的情况，一起降到224和4096就恢复正常了

Apr 23 '25 06:04 LKmidair

@LKmidair Could you please explain how you got the numbers?

Sep 06 '25 02:09 SStoica12