InternVL3.5-1B sft的时候报错 RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3
同样的数据在internvl3-1B训练ok,看了下配置也基本保持是一致的,但是在训练刚开始就报错 纬度不匹配,可以帮忙看下是哪里的问题嘛?
报错如下: rank18: outputs: BaseModelOutputWithPast = self.model(
rank18: return self._call_impl(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
rank18: return inner()
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner
rank18: result = forward_call(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/utils/generic.py", line 1083, in wrapper
rank18: outputs = func(self, *args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 405, in forward
rank18: hidden_states = decoder_layer(
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in call
rank18: return self._gradient_checkpointing_func(partial(super().call, **kwargs), *args)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
rank18: return disable_fn(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
rank18: return fn(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
rank18: return CheckpointFunction.apply(function, preserve, *args)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply
rank18: return super().apply(*args, **kwargs) # type: ignore[misc]
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 264, in forward
rank18: outputs = run_function(*args)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank18: return self._call_impl(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
rank18: return inner()
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner
rank18: result = forward_call(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 257, in forward
rank18: hidden_states, _ = self.self_attn(
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank18: return self._call_impl(*args, **kwargs)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
rank18: return inner()
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner
rank18: result = forward_call(*args, **kwargs)
rank18: File "/code/xxx/LLM/VLM/InternVL3.5/internvl_chat_gpt_oss/internvl/patch/flash_sink_attn_monkey_patch.py", line 35, in _forward_gpt_oss
rank18: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 222, in apply_rotary_pos_emb
rank18: q_embed = _apply_rotary_emb(q, cos, sin)
rank18: File "/code/xxx/envs/internvl3/lib/python3.11/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 214, in _apply_rotary_emb
rank18: first_ = first_half * cos - second_half * sin
rank18: RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3
use_packed_ds设置成True的时候请确保use_custom_flash_attn也是True
use_packed_ds设置成True的时候请确保use_custom_flash_attn也是True
use_custom_flash_attn设置成True的时候会替换的函数是“_forward_gpt_oss_with_varlen”,但是1B的语言模型不是gpt,这个函数也是适用的吗?
你说的对,确实得更新一下monkey patch,之前疏忽了,我更新了一版新的代码,你可以再pull一下试试
你说的对,确实得更新一下monkey patch,之前疏忽了,我更新了一版新的代码,你可以再pull一下试试
新的代码会出现同样的问题,使用的是InternVL3.5-4B模型 https://github.com/OpenGVLab/InternVL/issues/1124#issuecomment-3243580319
训练qwen模型时候的--use_custom_flash_attn需要设置成False,只有训GPT-OSS的时候需要设置成True,这个问题也是昨晚的PR一起修的
使用新代码(9.2下午拉的)训练报错 internvl_chat_gpt_oss/internvl/patch/qwen3_flash_monkey_patch.py", line 50, in _forward_qwen3 [rank1]: assert query_states.size(0) == key_states.size(0) == value_states.size(0) == 1 [rank1]: AssertionError
脚本
internvl/train/internvl_chat_finetune.py
--model_name_or_path "/home/InternVL3_5-1B/"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "../data/mydata.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.0
--min_num_frame 8
--max_num_frame 32
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--max_steps 8000
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--save_strategy "steps"
--save_steps 100
--save_total_limit 2
--learning_rate 8e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 32768
--split_annotations True
--do_train True
--grad_checkpoint True
--gradient_checkpointing True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn False
--report_to "tensorboard"
--deepspeed "zero_stage3_config.json"
--use_packed_ds True
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20
--log_freq 1000
--strict_mode False
--replacement True
--allow_overflow False
--remove_unused_columns False
--loss_reduction "square"
--seed 42
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1
开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1
请问现在支持非packing方式,bs > 1训练吗,有相应脚本可以参考吗
开启pack训练的时候PER_DEVICE_BATCH_SIZE必须是1
请问现在支持非packing方式,bs > 1训练吗,有相应脚本可以参考吗
use_packed_ds设置成False就行,但是packing的训练效率是显著高于不packing的,大规模训练的时候速度能差2~3倍