VLMEvalKit
VLMEvalKit copied to clipboard
There is a long gap between the validation accuracy of the dataset of vlmevalkit and the model paper
On the TextVQA dataset, the paper in Instructblip 13b
indicates that its precision is 50.7
, and the paper in Qwen VL Chat
shows an accuracy of 63.75
.
In terms of the accuracy measured by the vlmevalkit official, the accuracy of Instructblip 13b
is about 30
, and the accuracy of QWEN VL Chat
is 10.5
, what do you think is the problem?
Also, I tested the accuracy of Instructblip 13b on textVQA and found that I ran with an accuracy of 16.7, what went wrong? These are all the results of prefech, and GPT is not used.
Hi, @YongLD , Actually, the support of VQA datasets is still in progress (we only share some preliminary results for now). We still cannot obtain the corresponding accuracies reported by the VLM papers. Potential reasons might be different prompt used or different inference hyper parameters.
Hello, I find that for TextVQA dataset, LLaVA evaluation with with reference token like: What kind of beer is this?\nReference OCR token: NINK, NK, BOWING, CC, STON, SUE, ED, Sublimely, SELF, ELF-RICHEE, swAaVd, KGy, ALE\nAnswer the question using a single word or phrase. in VLMEvalKit does not apply that token, would you like to add an option for users to choose to add the reference tokens or not?
@kennymckormick Can we use the azure openai key in VlmEvalKit? How can I change the base_url of azure?
@kennymckormick Can we use the azure openai key in VlmEvalKit? How can I change the base_url of azure?
Currently, VLMEvalKit does not support openai api key (cuz I do not have azure api access can cannot debug). You can follow the azure doc to support an azure openai wrapper in VLMEvalKit.
I've noticed that most of the results are dissimilar compared to those in the research paper. I believe that the framework that's in use should be rectified, making it more corresponding with the original results.
I've noticed that most of the results are dissimilar compared to those in the research paper. I believe that the framework that's in use should be rectified, making it more corresponding with the original results.
Hi, @geknow , We understand the known issues with the preliminary results on VQA tasks might be misleading. We have know removed these results and will re-upload them once ready.