lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Can't reproduce the Chinese FineTasks results.

Open bg51717 opened this issue 3 months ago • 0 comments

Thank you for your work—it helps fill an important gap in multilingual model evaluation. I evaluated Qwen/Qwen2.5-3B using the following command:

  lighteval accelerate\
      --model_args "vllm,pretrained=Qwen/Qwen2.5-3B,pairwise_tokenization=True,dtype=bfloat16,gpu_memory_utilisation=0.8" \
      --custom_task lighteval.tasks.multilingual.tasks \
      --tasks 'eval/zh.txt' \
      --max_samples '1000' \
      --output_dir "eval_results/"

The results I obtained were:

worker-0 >> |                               Task                               |Version|        Metric        |Value |   |Stderr|
worker-0 >> |------------------------------------------------------------------|------:|----------------------|-----:|---|-----:|
worker-0 >> |all                                                               |       |acc_                  |0.7052|±  |0.0504|
worker-0 >> |                                                                  |       |acc_norm_token        |0.6250|±  |0.0216|
worker-0 >> |                                                                  |       |acc_norm              |0.6250|±  |0.0216|
worker-0 >> |                                                                  |       |exact_match_zho_prefix|0.4670|±  |0.0155|
worker-0 >> |                                                                  |       |f1_zho                |0.6454|±  |0.0113|
worker-0 >> |lighteval:agieval_zho_mcf:_average:0                              |       |acc_                  |0.6443|±  |0.0260|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-biology:0                        |      0|acc_                  |0.8238|±  |0.0264|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-chemistry:0                      |      0|acc_                  |0.6146|±  |0.0341|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-chinese:0                        |      0|acc_                  |0.6626|±  |0.0302|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-geography:0                      |      0|acc_                  |0.7020|±  |0.0326|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-history:0                        |      0|acc_                  |0.8383|±  |0.0241|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-mathqa:0                         |      0|acc_                  |0.4131|±  |0.0263|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-physics:0                        |      0|acc_                  |0.6000|±  |0.0347|
worker-0 >> |lighteval:agieval_zho_mcf:jec-qa-ca:0                             |      0|acc_                  |0.6254|±  |0.0160|
worker-0 >> |lighteval:agieval_zho_mcf:jec-qa-kd:0                             |      0|acc_                  |0.6705|±  |0.0159|
worker-0 >> |lighteval:agieval_zho_mcf:logiqa-zh:0                             |      0|acc_                  |0.4931|±  |0.0196|
worker-0 >> |lighteval:belebele_zho_Hans_mcf:0                                 |      0|acc_                  |0.8167|±  |0.0129|
worker-0 >> |lighteval:c3_zho_mcf:0                                            |      0|acc_                  |0.8950|±  |0.0097|
worker-0 >> |lighteval:ceval_zho_mcf:_average:0                                |       |acc_                  |0.7154|±  |0.0827|
worker-0 >> |lighteval:ceval_zho_mcf:accountant:0                              |      0|acc_                  |0.7755|±  |0.0602|
worker-0 >> |lighteval:ceval_zho_mcf:advanced_mathematics:0                    |      0|acc_                  |0.2105|±  |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:art_studies:0                             |      0|acc_                  |0.6970|±  |0.0812|
worker-0 >> |lighteval:ceval_zho_mcf:basic_medicine:0                          |      0|acc_                  |0.7895|±  |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:business_administration:0                 |      0|acc_                  |0.7273|±  |0.0787|
worker-0 >> |lighteval:ceval_zho_mcf:chinese_language_and_literature:0         |      0|acc_                  |0.4783|±  |0.1065|
worker-0 >> |lighteval:ceval_zho_mcf:civil_servant:0                           |      0|acc_                  |0.6087|±  |0.0728|
worker-0 >> |lighteval:ceval_zho_mcf:clinical_medicine:0                       |      0|acc_                  |0.5909|±  |0.1073|
worker-0 >> |lighteval:ceval_zho_mcf:college_chemistry:0                       |      0|acc_                  |0.4583|±  |0.1039|
worker-0 >> |lighteval:ceval_zho_mcf:college_economics:0                       |      0|acc_                  |0.6000|±  |0.0667|
worker-0 >> |lighteval:ceval_zho_mcf:college_physics:0                         |      0|acc_                  |0.5789|±  |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:college_programming:0                     |      0|acc_                  |0.6486|±  |0.0796|
worker-0 >> |lighteval:ceval_zho_mcf:computer_architecture:0                   |      0|acc_                  |0.6667|±  |0.1054|
worker-0 >> |lighteval:ceval_zho_mcf:computer_network:0                        |      0|acc_                  |0.8947|±  |0.0723|
worker-0 >> |lighteval:ceval_zho_mcf:discrete_mathematics:0                    |      0|acc_                  |0.3750|±  |0.1250|
worker-0 >> |lighteval:ceval_zho_mcf:education_science:0                       |      0|acc_                  |0.8966|±  |0.0576|
worker-0 >> |lighteval:ceval_zho_mcf:electrical_engineer:0                     |      0|acc_                  |0.4054|±  |0.0818|
worker-0 >> |lighteval:ceval_zho_mcf:environmental_impact_assessment_engineer:0|      0|acc_                  |0.6129|±  |0.0889|
worker-0 >> |lighteval:ceval_zho_mcf:fire_engineer:0                           |      0|acc_                  |0.6452|±  |0.0874|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_biology:0                     |      0|acc_                  |0.7895|±  |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_chemistry:0                   |      0|acc_                  |0.6316|±  |0.1137|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_chinese:0                     |      0|acc_                  |0.5789|±  |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_geography:0                   |      0|acc_                  |0.8889|±  |0.0762|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_history:0                     |      0|acc_                  |0.8000|±  |0.0918|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_mathematics:0                 |      0|acc_                  |0.3889|±  |0.1182|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_physics:0                     |      0|acc_                  |0.7895|±  |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_politics:0                    |      0|acc_                  |0.9474|±  |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:ideological_and_moral_cultivation:0       |      0|acc_                  |0.9474|±  |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:law:0                                     |      0|acc_                  |0.6667|±  |0.0983|
worker-0 >> |lighteval:ceval_zho_mcf:legal_professional:0                      |      0|acc_                  |0.6957|±  |0.0981|
worker-0 >> |lighteval:ceval_zho_mcf:logic:0                                   |      0|acc_                  |0.5455|±  |0.1087|
worker-0 >> |lighteval:ceval_zho_mcf:mao_zedong_thought:0                      |      0|acc_                  |0.8333|±  |0.0777|
worker-0 >> |lighteval:ceval_zho_mcf:marxism:0                                 |      0|acc_                  |0.9474|±  |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:metrology_engineer:0                      |      0|acc_                  |0.8333|±  |0.0777|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_biology:0                   |      0|acc_                  |1.0000|±  |0.0000|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_chemistry:0                 |      0|acc_                  |1.0000|±  |0.0000|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_geography:0                 |      0|acc_                  |0.8182|±  |0.1220|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_history:0                   |      0|acc_                  |0.9545|±  |0.0455|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_mathematics:0               |      0|acc_                  |0.5789|±  |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_physics:0                   |      0|acc_                  |0.9474|±  |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_politics:0                  |      0|acc_                  |0.9048|±  |0.0656|
worker-0 >> |lighteval:ceval_zho_mcf:modern_chinese_history:0                  |      0|acc_                  |0.8696|±  |0.0718|
worker-0 >> |lighteval:ceval_zho_mcf:operating_system:0                        |      0|acc_                  |0.6842|±  |0.1096|
worker-0 >> |lighteval:ceval_zho_mcf:physician:0                               |      0|acc_                  |0.6735|±  |0.0677|
worker-0 >> |lighteval:ceval_zho_mcf:plant_protection:0                        |      0|acc_                  |0.7727|±  |0.0914|
worker-0 >> |lighteval:ceval_zho_mcf:probability_and_statistics:0              |      0|acc_                  |0.3333|±  |0.1143|
worker-0 >> |lighteval:ceval_zho_mcf:professional_tour_guide:0                 |      0|acc_                  |0.8621|±  |0.0652|
worker-0 >> |lighteval:ceval_zho_mcf:sports_science:0                          |      0|acc_                  |0.8421|±  |0.0859|
worker-0 >> |lighteval:ceval_zho_mcf:tax_accountant:0                          |      0|acc_                  |0.6939|±  |0.0665|
worker-0 >> |lighteval:ceval_zho_mcf:teacher_qualification:0                   |      0|acc_                  |0.8636|±  |0.0523|
worker-0 >> |lighteval:ceval_zho_mcf:urban_and_rural_planner:0                 |      0|acc_                  |0.7609|±  |0.0636|
worker-0 >> |lighteval:ceval_zho_mcf:veterinary_medicine:0                     |      0|acc_                  |0.6957|±  |0.0981|
worker-0 >> |lighteval:chinese_squad_zho:0                                     |      0|exact_match_zho_prefix|0.5590|±  |0.0157|
worker-0 >> |                                                                  |       |f1_zho                |0.5994|±  |0.0125|
worker-0 >> |lighteval:cmmlu_zho_mcf:_average:0                                |       |acc_                  |0.7079|±  |0.0341|
worker-0 >> |lighteval:cmmlu_zho_mcf:agronomy:0                                |      0|acc_                  |0.6154|±  |0.0375|
worker-0 >> |lighteval:cmmlu_zho_mcf:anatomy:0                                 |      0|acc_                  |0.7432|±  |0.0360|
worker-0 >> |lighteval:cmmlu_zho_mcf:ancient_chinese:0                         |      0|acc_                  |0.4268|±  |0.0387|
worker-0 >> |lighteval:cmmlu_zho_mcf:arts:0                                    |      0|acc_                  |0.8938|±  |0.0244|
worker-0 >> |lighteval:cmmlu_zho_mcf:astronomy:0                               |      0|acc_                  |0.4727|±  |0.0390|
worker-0 >> |lighteval:cmmlu_zho_mcf:business_ethics:0                         |      0|acc_                  |0.6603|±  |0.0328|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_civil_service_exam:0              |      0|acc_                  |0.7188|±  |0.0357|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_driving_rule:0                    |      0|acc_                  |0.9466|±  |0.0197|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_food_culture:0                    |      0|acc_                  |0.6471|±  |0.0411|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_foreign_policy:0                  |      0|acc_                  |0.7290|±  |0.0432|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_history:0                         |      0|acc_                  |0.8390|±  |0.0205|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_literature:0                      |      0|acc_                  |0.6225|±  |0.0340|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_teacher_qualification:0           |      0|acc_                  |0.8715|±  |0.0251|
worker-0 >> |lighteval:cmmlu_zho_mcf:clinical_knowledge:0                      |      0|acc_                  |0.6962|±  |0.0299|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_actuarial_science:0               |      0|acc_                  |0.3302|±  |0.0459|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_education:0                       |      0|acc_                  |0.8318|±  |0.0363|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_engineering_hydrology:0           |      0|acc_                  |0.6887|±  |0.0452|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_law:0                             |      0|acc_                  |0.6204|±  |0.0469|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_mathematics:0                     |      0|acc_                  |0.2667|±  |0.0434|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_medical_statistics:0              |      0|acc_                  |0.6415|±  |0.0468|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_medicine:0                        |      0|acc_                  |0.7985|±  |0.0243|
worker-0 >> |lighteval:cmmlu_zho_mcf:computer_science:0                        |      0|acc_                  |0.8186|±  |0.0270|
worker-0 >> |lighteval:cmmlu_zho_mcf:computer_security:0                       |      0|acc_                  |0.8480|±  |0.0275|
worker-0 >> |lighteval:cmmlu_zho_mcf:conceptual_physics:0                      |      0|acc_                  |0.8231|±  |0.0316|
worker-0 >> |lighteval:cmmlu_zho_mcf:construction_project_management:0         |      0|acc_                  |0.5899|±  |0.0419|
worker-0 >> |lighteval:cmmlu_zho_mcf:economics:0                               |      0|acc_                  |0.7358|±  |0.0351|
worker-0 >> |lighteval:cmmlu_zho_mcf:education:0                               |      0|acc_                  |0.7485|±  |0.0341|
worker-0 >> |lighteval:cmmlu_zho_mcf:electricWARNING:lighteval.logging.hierarchical_logger:  Saving experiment tracker
worker-0 >> al_engineering:0                  |      0|acc_                  |0.7965|±  |0.0308|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_chinese:0                      |      0|acc_                  |0.7421|±  |0.0276|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_commonsense:0                  |      0|acc_                  |0.7525|±  |0.0307|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_information_and_technology:0   |      0|acc_                  |0.8866|±  |0.0206|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_mathematics:0                  |      0|acc_                  |0.5174|±  |0.0330|
worker-0 >> |lighteval:cmmlu_zho_mcf:ethnology:0                               |      0|acc_                  |0.6815|±  |0.0402|
worker-0 >> |lighteval:cmmlu_zho_mcf:food_science:0                            |      0|acc_                  |0.6993|±  |0.0385|
worker-0 >> |lighteval:cmmlu_zho_mcf:genetics:0                                |      0|acc_                  |0.6250|±  |0.0366|
worker-0 >> |lighteval:cmmlu_zho_mcf:global_facts:0                            |      0|acc_                  |0.7517|±  |0.0355|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_biology:0                     |      0|acc_                  |0.7574|±  |0.0331|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_chemistry:0                   |      0|acc_                  |0.7045|±  |0.0399|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_geography:0                   |      0|acc_                  |0.7119|±  |0.0419|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_mathematics:0                 |      0|acc_                  |0.5305|±  |0.0391|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_physics:0                     |      0|acc_                  |0.6636|±  |0.0453|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_politics:0                    |      0|acc_                  |0.6503|±  |0.0400|
worker-0 >> |lighteval:cmmlu_zho_mcf:human_sexuality:0                         |      0|acc_                  |0.5794|±  |0.0442|
worker-0 >> |lighteval:cmmlu_zho_mcf:international_law:0                       |      0|acc_                  |0.5784|±  |0.0364|
worker-0 >> |lighteval:cmmlu_zho_mcf:journalism:0                              |      0|acc_                  |0.6453|±  |0.0366|
worker-0 >> |lighteval:cmmlu_zho_mcf:jurisprudence:0                           |      0|acc_                  |0.7859|±  |0.0203|
worker-0 >> |lighteval:cmmlu_zho_mcf:legal_and_moral_basis:0                   |      0|acc_                  |0.9766|±  |0.0104|
worker-0 >> |lighteval:cmmlu_zho_mcf:logical:0                                 |      0|acc_                  |0.6179|±  |0.0440|
worker-0 >> |lighteval:cmmlu_zho_mcf:machine_learning:0                        |      0|acc_                  |0.6803|±  |0.0424|
worker-0 >> |lighteval:cmmlu_zho_mcf:management:0                              |      0|acc_                  |0.8238|±  |0.0264|
worker-0 >> |lighteval:cmmlu_zho_mcf:marketing:0                               |      0|acc_                  |0.7722|±  |0.0313|
worker-0 >> |lighteval:cmmlu_zho_mcf:marxist_theory:0                          |      0|acc_                  |0.9418|±  |0.0171|
worker-0 >> |lighteval:cmmlu_zho_mcf:modern_chinese:0                          |      0|acc_                  |0.5172|±  |0.0466|
worker-0 >> |lighteval:cmmlu_zho_mcf:nutrition:0                               |      0|acc_                  |0.7103|±  |0.0378|
worker-0 >> |lighteval:cmmlu_zho_mcf:philosophy:0                              |      0|acc_                  |0.8381|±  |0.0361|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_accounting:0                 |      0|acc_                  |0.8400|±  |0.0278|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_law:0                        |      0|acc_                  |0.6730|±  |0.0324|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_medicine:0                   |      0|acc_                  |0.6516|±  |0.0246|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_psychology:0                 |      0|acc_                  |0.8405|±  |0.0241|
worker-0 >> |lighteval:cmmlu_zho_mcf:public_relations:0                        |      0|acc_                  |0.6667|±  |0.0358|
worker-0 >> |lighteval:cmmlu_zho_mcf:security_study:0                          |      0|acc_                  |0.7926|±  |0.0350|
worker-0 >> |lighteval:cmmlu_zho_mcf:sociology:0                               |      0|acc_                  |0.6903|±  |0.0308|
worker-0 >> |lighteval:cmmlu_zho_mcf:sports_science:0                          |      0|acc_                  |0.7273|±  |0.0348|
worker-0 >> |lighteval:cmmlu_zho_mcf:traditional_chinese_medicine:0            |      0|acc_                  |0.7189|±  |0.0331|
worker-0 >> |lighteval:cmmlu_zho_mcf:virology:0                                |      0|acc_                  |0.7633|±  |0.0328|
worker-0 >> |lighteval:cmmlu_zho_mcf:world_history:0                           |      0|acc_                  |0.7826|±  |0.0326|
worker-0 >> |lighteval:cmmlu_zho_mcf:world_religions:0                         |      0|acc_                  |0.7188|±  |0.0357|
worker-0 >> |lighteval:cmrc2018_zho:0                                          |      0|exact_match_zho_prefix|0.3750|±  |0.0153|
worker-0 >> |                                                                  |       |f1_zho                |0.6913|±  |0.0101|
worker-0 >> |lighteval:m3exams_zho_mcf:0                                       |      0|acc_                  |0.7871|±  |0.0157|
worker-0 >> |lighteval:mlmm_hellaswag_zho_mcf:0                                |      0|acc_                  |0.3380|±  |0.0150|
worker-0 >> |lighteval:ocnli_zho_mcf:0                                         |      0|acc_                  |0.7650|±  |0.0134|
worker-0 >> |lighteval:xcodah_zho_mcf:0                                        |      0|acc_                  |0.4767|±  |0.0289|
worker-0 >> |lighteval:xcopa_zho_mcf:0                                         |      0|acc_                  |0.8320|±  |0.0167|
worker-0 >> |lighteval:xcsqa_zho_mcf:0                                         |      0|acc_                  |0.5250|±  |0.0158|
worker-0 >> |lighteval:xstory_cloze_zho_mcf:0                                  |      0|acc_                  |0.8900|±  |0.0099|
worker-0 >> |lighteval:xwinograd_zho_mcf:0                                     |      0|acc_                  |0.6250|±  |0.0216|
worker-0 >> |                                                                  |       |acc_norm_token        |0.6250|±  |0.0216|
worker-0 >> |                                                                  |       |acc_norm              |0.6250|±  |0.0216|

They differ somewhat from the results reported in your blog.

Image

Could you let me know how the scores are computed precisely so I can verify my setup?

bg51717 avatar Sep 17 '25 16:09 bg51717