opencompass Gaokao and some datasets appear many zero when I evaluate them [Bug]

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

latest environment

Reproduces the problem - code/configuration sample

1

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python run.py --hf-path /Llama-2-7b-hf --datasets gsm8k_gen_1d7fe4 bbh_gen math_gen_265cce GaokaoBench_gen_5cfe9e agieval_gen_a0c741 --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 100 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1 --max-partition-size 15000

Reproduces the problem - error message

appear too much zero ; and in my evaluation, result of Gaokao for llmam2-7b-hf is 7.06 which is not consistent with 18.9 in https://opencompass.org.cn/leaderboard-llm thank u and much appreciate

Other information

i want to reproduce 18.9 in your website !!

Oct 14 '23 06:10 kkwhale7

@tonysy @lvhan028 @so2liu @cdpath i need your help!!

Oct 14 '23 06:10 kkwhale7

@Leymore

Oct 15 '23 02:10 kkwhale7

You haven't implemented the evaluation logic for subjective questions, why are the values displayed on the official website different from ours

Oct 16 '23 07:10 kkwhale7

We only include the objective questions of Gaokao in OpenCompass

Oct 16 '23 09:10 tonysy

We only include the objective questions of Gaokao in OpenCompass

but your score in your website 18.9 in GAOKAO we cant reproduce it!

Oct 16 '23 09:10 kkwhale7

in my way, I only calculate the objective score 15.13

Oct 16 '23 09:10 kkwhale7

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

Oct 17 '23 00:10 kirliavc

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

I got it.So do you directly ignore the scores of multiple topic selection or only calculate the parts greater than 25

Oct 17 '23 02:10 kkwhale7

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models

Oct 17 '23 03:10 kkwhale7

We will review this problem, more information and logs will be provided later.

Oct 17 '23 03:10 tonysy

Detailed scores can be found here: https://opencompass.org.cn/dataset-detail/GAOKAO-Bench

The average score is weighted by the total scores from each individual subjects. We do NOT ignore the scores below 25.0!

As for the failure on following the instruction by llama-2-7b，we think this is totally understandable. We implement the postprocess over here: https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/GaokaoBench.py . We depend the final result on the result of the this postprocess.

Oct 17 '23 10:10 Leymore

thank you for your patience. I know your score calculation method now. But why are the two predictions different when I have the same config, This postprocess method only answers the first ABCD character from the back to the front of the prediction, but the prediction is still inconsistent with other test https://github.com/open-compass/opencompass/issues/480#issuecomment-1765588411

Oct 17 '23 11:10 kkwhale7

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models So why are the two prediction results inconsistent using the gen approach ?

Oct 20 '23 07:10 kkwhale7

@kkwhale7 Hey, does that still exist?

Oct 30 '23 14:10 tonysy

@kkwhale7 Hey, does that still exist?

Yes, when I only calculated the average score of the subjects you specified, Baihuan2-7b-base scored only 17.33 on GAOKAO, while it was declared 34.8 on your official website. i cant reproduce it with your latest version

Nov 03 '23 02:11 kkwhale7

opencompass opencompass copied to clipboard

Gaokao and some datasets appear many zero when I evaluate them [Bug]

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

opencompass
opencompass copied to clipboard