InternVL About InternVL3

hi, does the SFT script of InternVL3 remain the same as InternVL2.5? cc @czczup

Apr 14 '25 04:04 zyandtom

also, i found InternVL3 has integrated V2PE, is there any change for visual preprocess?

Apr 14 '25 04:04 zyandtom

@zyandtom Hi! Could you find where the V2PE is implemented in InternVL3? I couldn't locate it :(

Apr 14 '25 09:04 JJJYmmm

@zyandtom Hi! Could you find where the V2PE is implemented in InternVL3? I couldn't locate it :(

i didn't find it in code too, i just saw it in the hf description https://huggingface.co/OpenGVLab/InternVL3-2B

Apr 14 '25 09:04 zyandtom

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

Apr 16 '25 15:04 Lechatelia

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

thx！

Apr 16 '25 15:04 zyandtom

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

I understand that the paper shows little performance difference between using V2PE and the original RoPE during training. However, since InternVL3 is trained based on V2PE, would replacing it with RoPE during inference truly have no impact on performance? After all, this introduces an inconsistency between the training and inference stages.

Jul 14 '25 11:07 shuzhangcasia

@shuzhangcasia There is no inconsistency between training and inference. You can consider the standard RoPE as a special case of V2PE where the delta equals 1.

Since V2PE is trained to adapt to a range of delta values—including delta = 1—it naturally includes the behavior of standard RoPE.
Using standard RoPE during inference is fully compatible with the V2PE-based training. For classical multimodal tasks with short context lengths, delta with 1 usually leads to the optimal performance.

Jul 14 '25 11:07 Lechatelia