Questions About InternVideo2clip Training Data and Fine-Tuning Requirements

Open JayChen7777 opened this issue 6 months ago • 0 comments

Thank you for your work! I have a question:

In the paper, it is stated: "We also learn a CLIP-style InternVideo2 indicated by InternVideo2clip. It is post-pretrained from InternVideo2s2 by only preserving video and text encoders and contrastive loss."

May I ask what training dataset was used for InternVideo2clip? Does it include any Chinese data? Approximately how much data would be required to fine-tune it effectively?

I noticed that only the attnpool of the vision encoder has been released in the official weights. https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4

Jun 12 '25 03:06 JayChen7777