Yi Result of Yi-6B-Chat on the BBH dataset cannot be reproduced

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced

Open zerovl opened this issue 10 months ago • 0 comments

Reminder

[X] I have searched the Github Discussion and issues and have not found anything similar to this.

Motivation

We tested Yi-6B-Chat on BBH and achieved the followed metrics: {'temporal_sequences': '0.2200', 'disambiguation_qa': '0.2880', 'date_understanding': '0.3720', 'tracking_shuffled_objects_three_objects': '0.3520', 'penguins_in_a_table': '0.3493', 'geometric_shapes': '0.1880', 'snarks': '0.5056', 'ruin_names': '0.2880', 'tracking_shuffled_objects_seven_objects': '0.0920', 'tracking_shuffled_objects_five_objects': '0.2520', 'logical_deduction_three_objects': '0.4440', 'hyperbaton': '0.4640', 'logical_deduction_five_objects': '0.2720', 'logical_deduction_seven_objects': '0.1720', 'movie_recommendation': '0.3320', 'salient_translation_error_detection': '0.1400', 'reasoning_about_colored_objects': '0.3800', 'multistep_arithmetic_two': '0.0640', 'navigate': '0.5080', 'dyck_languages': '0.0000', 'word_sorting': '0.0360', 'sports_understanding': '0.4960', 'boolean_expressions': '0.4240', 'object_counting': '0.3440', 'formal_fallacies': '0.5040', 'causal_judgement': '0.5187', 'web_of_lies': '0.4760', 'TOTAL_AVERAGE': '0.3095'}

It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?

We currently use this repo for evaluation: [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/light-eval/src/eval_bbh.py)

Solution

No response

Alternatives

No response

Anything Else?

No response

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

Apr 08 '24 05:04 zerovl

Yi Yi copied to clipboard

Result of Yi-6B-Chat on the BBH dataset cannot be reproduced

Reminder

Motivation

Solution

Alternatives

Anything Else?

Are you willing to submit a PR?

Yi
Yi copied to clipboard