InternVideo2.5 Temporal Modeling

Open arushirai1 opened this issue 9 months ago • 0 comments

Thank you for this video model! I had one question. Is all the temporal modeling in InternVideo2.5 offloaded to the LLM? This is what it appears from the demo provided here: https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B/ . Specifically, frame-level representations are passed as input to the LLM. See from demo: video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))]) Could you provide the motivation for not doing spatio-temporal modeling explicitly through the vision encoder layers.

Mar 13 '25 14:03 arushirai1