InternVideo
InternVideo copied to clipboard
InternVideo2.5 Temporal Modeling
Thank you for this video model!
I had one question. Is all the temporal modeling in InternVideo2.5 offloaded to the LLM? This is what it appears from the demo provided here: https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B/ . Specifically, frame-level representations are passed as input to the LLM. See from demo:
video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))])
Could you provide the motivation for not doing spatio-temporal modeling explicitly through the vision encoder layers.