opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

Gaokao and some datasets appear many zero when I evaluate them [Bug]

Open kkwhale7 opened this issue 2 years ago • 15 comments

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

latest environment

Reproduces the problem - code/configuration sample

1

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python run.py --hf-path /Llama-2-7b-hf --datasets gsm8k_gen_1d7fe4 bbh_gen math_gen_265cce GaokaoBench_gen_5cfe9e agieval_gen_a0c741 --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 100 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1 --max-partition-size 15000

Reproduces the problem - error message

image appear too much zero ; and in my evaluation, result of Gaokao for llmam2-7b-hf is 7.06 which is not consistent with 18.9 in https://opencompass.org.cn/leaderboard-llm thank u and much appreciate

Other information

i want to reproduce 18.9 in your website !!

kkwhale7 avatar Oct 14 '23 06:10 kkwhale7

@tonysy @lvhan028 @so2liu @cdpath i need your help!!

kkwhale7 avatar Oct 14 '23 06:10 kkwhale7

@Leymore

kkwhale7 avatar Oct 15 '23 02:10 kkwhale7

You haven't implemented the evaluation logic for subjective questions, why are the values displayed on the official website different from ours image

kkwhale7 avatar Oct 16 '23 07:10 kkwhale7

We only include the objective questions of Gaokao in OpenCompass

tonysy avatar Oct 16 '23 09:10 tonysy

We only include the objective questions of Gaokao in OpenCompass

but your score in your website 18.9 in GAOKAO image we cant reproduce it!

kkwhale7 avatar Oct 16 '23 09:10 kkwhale7

in my way, I only calculate the objective score 15.13 image

kkwhale7 avatar Oct 16 '23 09:10 kkwhale7

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

kirliavc avatar Oct 17 '23 00:10 kirliavc

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

I got it.So do you directly ignore the scores of multiple topic selection or only calculate the parts greater than 25

kkwhale7 avatar Oct 17 '23 02:10 kkwhale7

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models image image

kkwhale7 avatar Oct 17 '23 03:10 kkwhale7

We will review this problem, more information and logs will be provided later.

tonysy avatar Oct 17 '23 03:10 tonysy

Detailed scores can be found here: https://opencompass.org.cn/dataset-detail/GAOKAO-Bench

The average score is weighted by the total scores from each individual subjects. We do NOT ignore the scores below 25.0!

As for the failure on following the instruction by llama-2-7b,we think this is totally understandable. We implement the postprocess over here: https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/GaokaoBench.py . We depend the final result on the result of the this postprocess.

Leymore avatar Oct 17 '23 10:10 Leymore

thank you for your patience. I know your score calculation method now. But why are the two predictions different when I have the same config, This postprocess method only answers the first ABCD character from the back to the front of the prediction, but the prediction is still inconsistent with other test https://github.com/open-compass/opencompass/issues/480#issuecomment-1765588411

kkwhale7 avatar Oct 17 '23 11:10 kkwhale7

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models image image So why are the two prediction results inconsistent using the gen approach ?

kkwhale7 avatar Oct 20 '23 07:10 kkwhale7

@kkwhale7 Hey, does that still exist?

tonysy avatar Oct 30 '23 14:10 tonysy

@kkwhale7 Hey, does that still exist?

Yes, when I only calculated the average score of the subjects you specified, Baihuan2-7b-base scored only 17.33 on GAOKAO, while it was declared 34.8 on your official website. image image i cant reproduce it with your latest version

kkwhale7 avatar Nov 03 '23 02:11 kkwhale7