I observed a difference between my GPT-5 evaluation results on the ERQA dataset and the ones reported.

Open hjxwhy opened this issue 3 months ago • 1 comments

Thank you for your work! I have a question regarding the GPT-5 evaluation on the ERQA dataset in the latest InternVL3.5 paper. The reported GPT-5 score is 65.7, which seems quite high — in my own evaluation, I obtained a score of 55.44. Could you please share the evaluation details and API settings you used? Did you modify the prompt? For reference, I evaluated GPT-5 using the original ERQA codebase.

Sep 17 '25 06:09 hjxwhy

Thank you for your interest in our work. We reported this score based on the information provided in their official blog.

Sep 22 '25 01:09 Weiyun1025