BLINK_Benchmark
BLINK_Benchmark copied to clipboard
Suggestion to Include GPT-4o Results in Evaluation
Very helpful research, great worI wanted to express my appreciation for the excellent work your team has done in contributing significantly to the evaluation of visual language models. Your paper has been a valuable resource for the community.
I was wondering if you might be interested in incorporating the results of GPT-4o into your evaluations. From my own testing, I've found that GPT-4o demonstrates a substantial improvement in visual performance compared to GPT-4v. When I applied the same error cases you presented in your paper to GPT-4o, I found that it was able to correctly answer the majority of them.
Additionally, I noticed an interesting phenomenon during my testing - when the model initially provided an incorrect "only answer" response, asking it to explain its reasoning often led it to correct the error. This suggests that the Chain-of-Thought (CoT) approach may also be beneficial for improving the performance of visual language models.
I hope you find this feedback useful and would welcome the opportunity to discuss further ways to enhance the evaluation of these important models. Thank you again for your excellent work.k