VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

There is a long gap between the validation accuracy of the dataset of vlmevalkit and the model paper

Open YongLD opened this issue 1 year ago • 6 comments

On the TextVQA dataset, the paper in Instructblip 13b indicates that its precision is 50.7, and the paper in Qwen VL Chat shows an accuracy of 63.75. In terms of the accuracy measured by the vlmevalkit official, the accuracy of Instructblip 13b is about 30, and the accuracy of QWEN VL Chat is 10.5, what do you think is the problem? Also, I tested the accuracy of Instructblip 13b on textVQA and found that I ran with an accuracy of 16.7, what went wrong? These are all the results of prefech, and GPT is not used.

YongLD avatar Feb 23 '24 11:02 YongLD

Hi, @YongLD , Actually, the support of VQA datasets is still in progress (we only share some preliminary results for now). We still cannot obtain the corresponding accuracies reported by the VLM papers. Potential reasons might be different prompt used or different inference hyper parameters.

kennymckormick avatar Feb 24 '24 12:02 kennymckormick

Hello, I find that for TextVQA dataset, LLaVA evaluation with with reference token like: What kind of beer is this?\nReference OCR token: NINK, NK, BOWING, CC, STON, SUE, ED, Sublimely, SELF, ELF-RICHEE, swAaVd, KGy, ALE\nAnswer the question using a single word or phrase. in VLMEvalKit does not apply that token, would you like to add an option for users to choose to add the reference tokens or not?

John-Ge avatar Feb 25 '24 07:02 John-Ge

@kennymckormick Can we use the azure openai key in VlmEvalKit? How can I change the base_url of azure?

YongLD avatar Feb 28 '24 07:02 YongLD

@kennymckormick Can we use the azure openai key in VlmEvalKit? How can I change the base_url of azure?

Currently, VLMEvalKit does not support openai api key (cuz I do not have azure api access can cannot debug). You can follow the azure doc to support an azure openai wrapper in VLMEvalKit.

kennymckormick avatar Feb 28 '24 13:02 kennymckormick

I've noticed that most of the results are dissimilar compared to those in the research paper. I believe that the framework that's in use should be rectified, making it more corresponding with the original results.

geknow avatar Mar 13 '24 02:03 geknow

I've noticed that most of the results are dissimilar compared to those in the research paper. I believe that the framework that's in use should be rectified, making it more corresponding with the original results.

Hi, @geknow , We understand the known issues with the preliminary results on VQA tasks might be misleading. We have know removed these results and will re-upload them once ready.

kennymckormick avatar Mar 13 '24 02:03 kennymckormick