VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

The inference on the AMBER dataset is very slow.

Open pspdada opened this issue 8 months ago • 6 comments

The AMBER dataset is a yes-or-no dataset, but the questions in this dataset do not explicitly state: "Please answer yes or no." As a result, during evaluation, the model generates very long responses, which slows down the entire inference process. Is it possible to address this issue?

pspdada avatar Apr 26 '25 16:04 pspdada

Hi, thanks for pointing out the problem. I checked the the original AMBER dataset, they didn't provide such suffix like "Please answer yes or no.". For consistency, we did not use these suffix too.

If you want to generate shorter responses, you can modify the dataset file mannually or consider using custom prompt for the evaluated model.

MaoSong2022 avatar Apr 27 '25 08:04 MaoSong2022

That somehow makes sense, I can add this additional instruction to the test prompt of AMBER.

kennymckormick avatar Apr 27 '25 12:04 kennymckormick

Hi, thanks for pointing out the problem. I checked the the original AMBER dataset, they didn't provide such suffix like "Please answer yes or no.". For consistency, we did not use these suffix too.

If you want to generate shorter responses, you can modify the dataset file mannually or consider using custom prompt for the evaluated model.

Maybe it can be handled another way? Such as let the max new token when evaluating model at amber benchmark be 1?

pspdada avatar Apr 27 '25 12:04 pspdada

Hi, @pspdada . The problem has been resolved in https://github.com/open-compass/VLMEvalKit/pull/961.

kennymckormick avatar Apr 27 '25 12:04 kennymckormick

Hi, thanks for pointing out the problem. I checked the the original AMBER dataset, they didn't provide such suffix like "Please answer yes or no.". For consistency, we did not use these suffix too. If you want to generate shorter responses, you can modify the dataset file mannually or consider using custom prompt for the evaluated model.

Maybe it can be handled another way? Such as let the max new token when evaluating model at amber benchmark be 1?

I do not recommend this alternative, since most API VLMs may not say Yes / No in the beginning of their responses.

kennymckormick avatar Apr 27 '25 12:04 kennymckormick

That somehow makes sense, I can add this additional instruction to the test prompt of AMBER.

@kennymckormick I can help to do this

MaoSong2022 avatar Apr 27 '25 12:04 MaoSong2022