InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

About InternVL3

Open zyandtom opened this issue 8 months ago • 5 comments

hi, does the SFT script of InternVL3 remain the same as InternVL2.5? cc @czczup

zyandtom avatar Apr 14 '25 04:04 zyandtom

also, i found InternVL3 has integrated V2PE, is there any change for visual preprocess?

Image

zyandtom avatar Apr 14 '25 04:04 zyandtom

@zyandtom Hi! Could you find where the V2PE is implemented in InternVL3? I couldn't locate it :(

JJJYmmm avatar Apr 14 '25 09:04 JJJYmmm

@zyandtom Hi! Could you find where the V2PE is implemented in InternVL3? I couldn't locate it :(

i didn't find it in code too, i just saw it in the hf description https://huggingface.co/OpenGVLab/InternVL3-2B

zyandtom avatar Apr 14 '25 09:04 zyandtom

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

Lechatelia avatar Apr 16 '25 15:04 Lechatelia

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

thx!

zyandtom avatar Apr 16 '25 15:04 zyandtom

Yes, InternVL3 is compatible with the training and inference code of InternVL2.5 — you can directly use the same codes without modification. Currently, V2PE is not integrated into the released codebase, but that's not an issue. The standard position embedding used is actually a special case of V2PE and performs well on tasks with short context, so there's no need to change anything for classical multimodal tasks. We are working on supporting V2PE for long-context fine-tuning in future updates.

I understand that the paper shows little performance difference between using V2PE and the original RoPE during training. However, since InternVL3 is trained based on V2PE, would replacing it with RoPE during inference truly have no impact on performance? After all, this introduces an inconsistency between the training and inference stages.

shuzhangcasia avatar Jul 14 '25 11:07 shuzhangcasia

@shuzhangcasia There is no inconsistency between training and inference. You can consider the standard RoPE as a special case of V2PE where the delta equals 1.

  1. Since V2PE is trained to adapt to a range of delta values—including delta = 1—it naturally includes the behavior of standard RoPE.
  2. Using standard RoPE during inference is fully compatible with the V2PE-based training. For classical multimodal tasks with short context lengths, delta with 1 usually leads to the optimal performance.

Lechatelia avatar Jul 14 '25 11:07 Lechatelia