VLMEvalKit [Help Wanted] the alignment with official accuracy in llama3.2-vision

Sep 29 '24 00:09 droidXrobot

Does the repo support this model yet? Thanks!

Sep 29 '24 20:09 shan23chen

Hi @droidXrobot @shan23chen! This repo now supports Llama-3.2-11B/90B-Vision-Instruct, you can use it with the newest transformers version (>=4.45.0.dev0)! However, the evaluation results obtained based on the current repo do not match the official results, and after the hyperparameters and the system prompt are aligned, there is still the problem of more dropped accuracy (mainly for ai2d). Is there anyone willing to solve this problem?

Ref: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/generation_config.json https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md

Oct 03 '24 16:10 FangXinyu-0913

Actually, all my benchmark can not align same model as before...

THIS REPO UPDATE TOO QUICK... Many things might casued not alignment..

Oct 09 '24 06:10 luohao123

Actually, all my benchmark can not align same model as before...

THIS REPO UPDATE TOO QUICK... Many things might casued not alignment..

Would you please provide more information, such as the corresponding commit ID of the previous and current code you used for evaluation, as well as the model & benchmark you have evaluated?

Oct 09 '24 06:10 kennymckormick

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

The tsv file new generated;
This operation not have before, I donkt now what is this: and it is slow
the metric now all lower, on all benchmarks, same model
I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

Oct 09 '24 07:10 luohao123

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

The tsv file new generated;

This operation not have before, I donkt now what is this: and it is slow

the metric now all lower, on all benchmarks, same model

I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

At least, you need to provide some information so that we can help. Please tell me the model you are using, one / several benchmarks you are evaluating. If you cannot find out the initial commit you are using, please try to remember when you first use this codebase.

Oct 09 '24 07:10 kennymckormick

As for user, we can not compare each commit to see what changed, it's your responsibility.

The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

The tsv file new generated;

This operation not have before, I donkt now what is this: and it is slow

the metric now all lower, on all benchmarks, same model

I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week,

afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before.

Any suggestion?

Same Here: https://github.com/open-compass/VLMEvalKit/issues/503#issuecomment-2404134873

Also, if you want to go further with this problem, maybe creating a new issue is a better idea. You problem is not related to the issue of llama-3.2.

Oct 10 '24 06:10 kennymckormick

As for user, we can not compare each commit to see what changed, it's your responsibility. The current situation, all benchmark are drop almost can treat wrong evaluation, The changes I can observe is:

The tsv file new generated;

This operation not have before, I donkt now what is this: and it is slow

the metric now all lower, on all benchmarks, same model

I dont know what changed inside the evalkit.

I even doubt is my training codebase got wrong, stuck me about 1 week, afterwards, I relise, the evaluation pipeline was broken, the old model can not repeat the metric before. Any suggestion?

Same Here: #503 (comment)

Also, if you want to go further with this problem, maybe creating a new issue is a better idea. You problem is not related to the issue of llama-3.2.

@kennymckormick Same issue here too. #523 Could you please check this one? Thanks!

Oct 16 '24 05:10 terry-for-github