Questions about your model in video-mme

Open zmj1203 opened this issue 1 year ago • 1 comments

I noticed your latest good results on video-mme (https://video-mme.github.io/home_page.html#leaderboard), ranking 9th, the parameter size is 20B, the number of image frames is 10 frames, you also announced this good result on the github homepage, I am curious:

How did you test your model? The model you have open-sourced seems to be a single-frame model? How to expand it to 10 frames of images?
Which model is your 20B model? Is it released open-source? Thank you!

Jul 03 '24 11:07 zmj1203

Hello, this result was tested by the author of Video-MME. I am not clear about some of the details, but I can reproduce this result recently when I tested the InternVL-Chat-V1-5 model using the Video-MME dataset integrated in VLMEvalKit. Maybe even a little higher. You can look at VLMEvalKit

Jul 31 '24 08:07 czczup