Yi
Yi copied to clipboard
Result of Yi-6B-Chat on the BBH dataset cannot be reproduced
Reminder
- [X] I have searched the Github Discussion and issues and have not found anything similar to this.
Motivation
We tested Yi-6B-Chat on BBH and achieved the followed metrics: {'temporal_sequences': '0.2200', 'disambiguation_qa': '0.2880', 'date_understanding': '0.3720', 'tracking_shuffled_objects_three_objects': '0.3520', 'penguins_in_a_table': '0.3493', 'geometric_shapes': '0.1880', 'snarks': '0.5056', 'ruin_names': '0.2880', 'tracking_shuffled_objects_seven_objects': '0.0920', 'tracking_shuffled_objects_five_objects': '0.2520', 'logical_deduction_three_objects': '0.4440', 'hyperbaton': '0.4640', 'logical_deduction_five_objects': '0.2720', 'logical_deduction_seven_objects': '0.1720', 'movie_recommendation': '0.3320', 'salient_translation_error_detection': '0.1400', 'reasoning_about_colored_objects': '0.3800', 'multistep_arithmetic_two': '0.0640', 'navigate': '0.5080', 'dyck_languages': '0.0000', 'word_sorting': '0.0360', 'sports_understanding': '0.4960', 'boolean_expressions': '0.4240', 'object_counting': '0.3440', 'formal_fallacies': '0.5040', 'causal_judgement': '0.5187', 'web_of_lies': '0.4760', 'TOTAL_AVERAGE': '0.3095'}
It seems that the averaged score is far from 47.15 as reported in the paper. Any plan to release the evaluation code so that we can reproduce results on academic datasets?
We currently use this repo for evaluation: [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/light-eval/src/eval_bbh.py)
Solution
No response
Alternatives
No response
Anything Else?
No response
Are you willing to submit a PR?
- [X] I'm willing to submit a PR!