InternVL
InternVL copied to clipboard
Questions about your model in video-mme
I noticed your latest good results on video-mme (https://video-mme.github.io/home_page.html#leaderboard), ranking 9th, the parameter size is 20B, the number of image frames is 10 frames, you also announced this good result on the github homepage, I am curious:
- How did you test your model? The model you have open-sourced seems to be a single-frame model? How to expand it to 10 frames of images?
- Which model is your 20B model? Is it released open-source?
Thank you!
Hello, this result was tested by the author of Video-MME. I am not clear about some of the details, but I can reproduce this result recently when I tested the InternVL-Chat-V1-5 model using the Video-MME dataset integrated in VLMEvalKit. Maybe even a little higher. You can look at VLMEvalKit