InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Question on 2D pixel shuffle in InternVL-2.5

Open franciszchen opened this issue 9 months ago • 0 comments

Thanks for sharing this great project. Here I have a question on the 2D pixel shuffle. The vit_embeds has the shape of [N, L, C], and it is first reshaped to [N, h, w, C], and then is performed with pixel shuffle in two dimensions by reshaping, permute and contiguous for [N, h/2, w/2, C4]. But finally the vit_embeds is reshaped into [N, L/4, C4] for further usage. Why not directly perform the pixel shuffle on vit_embeds with [N, L, C] into [N, L/4, C*4]?If we modify the inference code with this pixel shuffle, will this change has significant influence on the performance?

https://github.com/OpenGVLab/InternVL/blob/34a81000402bf8f716bab8c9b57aff1f6b436bd0/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py#L287

franciszchen avatar Mar 21 '25 18:03 franciszchen