InternVL InternVL3.5-1B sft的时候报错 RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3

同样的数据在internvl3-1B训练ok，看了下配置也基本保持是一致的，但是在训练刚开始就报错纬度不匹配，可以帮忙看下是哪里的问题嘛？

报错如下： rank18: outputs: BaseModelOutputWithPast = self.model(

rank18: return self._call_impl(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl

rank18: return inner()

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner

rank18: result = forward_call(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/utils/generic.py", line 1083, in wrapper

rank18: outputs = func(self, *args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 405, in forward

rank18: hidden_states = decoder_layer(

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in call

rank18: return self._gradient_checkpointing_func(partial(super().call, **kwargs), *args)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner

rank18: return disable_fn(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn

rank18: return fn(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint

rank18: return CheckpointFunction.apply(function, preserve, *args)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply

rank18: return super().apply(*args, **kwargs) # type: ignore[misc]

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 264, in forward

rank18: outputs = run_function(*args)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl

rank18: return self._call_impl(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl

rank18: return inner()

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner

rank18: result = forward_call(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 257, in forward

rank18: hidden_states, _ = self.self_attn(

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl

rank18: return self._call_impl(*args, **kwargs)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl

rank18: return inner()

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner

rank18: result = forward_call(*args, **kwargs)

rank18: File "/code/xxx/LLM/VLM/InternVL3.5/internvl_chat_gpt_oss/internvl/patch/flash_sink_attn_monkey_patch.py", line 35, in _forward_gpt_oss

rank18: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb

rank18: q_embed = _apply_rotary_emb(q, cos, sin)

rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 214, in _apply_rotary_emb

rank18: first_ = first_half * cos - second_half * sin

rank18: RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3

Sep 01 '25 12:09 zr-icu

use_packed_ds设置成True的时候请确保use_custom_flash_attn也是True

Sep 01 '25 13:09 Weiyun1025

use_packed_ds设置成True的时候请确保use_custom_flash_attn也是True

use_custom_flash_attn设置成True的时候会替换的函数是“_forward_gpt_oss_with_varlen”，但是1B的语言模型不是gpt，这个函数也是适用的吗？

Sep 01 '25 16:09 L-Hugh

你说的对，确实得更新一下monkey patch，之前疏忽了，我更新了一版新的代码，你可以再pull一下试试

Sep 01 '25 17:09 Weiyun1025

你说的对，确实得更新一下monkey patch，之前疏忽了，我更新了一版新的代码，你可以再pull一下试试

新的代码会出现同样的问题，使用的是InternVL3.5-4B模型 https://github.com/OpenGVLab/InternVL/issues/1124#issuecomment-3243580319

Sep 02 '25 02:09 HuaYuexia

训练qwen模型时候的--use_custom_flash_attn需要设置成False，只有训GPT-OSS的时候需要设置成True，这个问题也是昨晚的PR一起修的

Sep 02 '25 03:09 Weiyun1025

使用新代码（9.2下午拉的）训练报错 internvl_chat_gpt_oss/internvl/patch/qwen3_flash_monkey_patch.py", line 50, in _forward_qwen3 [rank1]: assert query_states.size(0) == key_states.size(0) == value_states.size(0) == 1 [rank1]: AssertionError

脚本 internvl/train/internvl_chat_finetune.py
--model_name_or_path "/home/InternVL3_5-1B/"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "../data/mydata.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.0
--min_num_frame 8
--max_num_frame 32
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--max_steps 8000
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--save_strategy "steps"
--save_steps 100
--save_total_limit 2
--learning_rate 8e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 32768
--split_annotations True
--do_train True
--grad_checkpoint True
--gradient_checkpointing True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn False
--report_to "tensorboard"
--deepspeed "zero_stage3_config.json"
--use_packed_ds True
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20
--log_freq 1000
--strict_mode False
--replacement True
--allow_overflow False
--remove_unused_columns False
--loss_reduction "square"
--seed 42
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Sep 02 '25 07:09 bixiaopeng0

开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1

Sep 02 '25 07:09 Weiyun1025

开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1

请问现在支持非packing方式，bs > 1训练吗，有相应脚本可以参考吗

Sep 02 '25 07:09 bixiaopeng0

开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1

请问现在支持非packing方式，bs > 1训练吗，有相应脚本可以参考吗

use_packed_ds设置成False就行，但是packing的训练效率是显著高于不packing的，大规模训练的时候速度能差2～3倍

Sep 02 '25 07:09 Weiyun1025