lighteval
lighteval copied to clipboard
Can't reproduce the Chinese FineTasks results.
Thank you for your work—it helps fill an important gap in multilingual model evaluation. I evaluated Qwen/Qwen2.5-3B using the following command:
lighteval accelerate\
--model_args "vllm,pretrained=Qwen/Qwen2.5-3B,pairwise_tokenization=True,dtype=bfloat16,gpu_memory_utilisation=0.8" \
--custom_task lighteval.tasks.multilingual.tasks \
--tasks 'eval/zh.txt' \
--max_samples '1000' \
--output_dir "eval_results/"
The results I obtained were:
worker-0 >> | Task |Version| Metric |Value | |Stderr|
worker-0 >> |------------------------------------------------------------------|------:|----------------------|-----:|---|-----:|
worker-0 >> |all | |acc_ |0.7052|± |0.0504|
worker-0 >> | | |acc_norm_token |0.6250|± |0.0216|
worker-0 >> | | |acc_norm |0.6250|± |0.0216|
worker-0 >> | | |exact_match_zho_prefix|0.4670|± |0.0155|
worker-0 >> | | |f1_zho |0.6454|± |0.0113|
worker-0 >> |lighteval:agieval_zho_mcf:_average:0 | |acc_ |0.6443|± |0.0260|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-biology:0 | 0|acc_ |0.8238|± |0.0264|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-chemistry:0 | 0|acc_ |0.6146|± |0.0341|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-chinese:0 | 0|acc_ |0.6626|± |0.0302|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-geography:0 | 0|acc_ |0.7020|± |0.0326|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-history:0 | 0|acc_ |0.8383|± |0.0241|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-mathqa:0 | 0|acc_ |0.4131|± |0.0263|
worker-0 >> |lighteval:agieval_zho_mcf:gaokao-physics:0 | 0|acc_ |0.6000|± |0.0347|
worker-0 >> |lighteval:agieval_zho_mcf:jec-qa-ca:0 | 0|acc_ |0.6254|± |0.0160|
worker-0 >> |lighteval:agieval_zho_mcf:jec-qa-kd:0 | 0|acc_ |0.6705|± |0.0159|
worker-0 >> |lighteval:agieval_zho_mcf:logiqa-zh:0 | 0|acc_ |0.4931|± |0.0196|
worker-0 >> |lighteval:belebele_zho_Hans_mcf:0 | 0|acc_ |0.8167|± |0.0129|
worker-0 >> |lighteval:c3_zho_mcf:0 | 0|acc_ |0.8950|± |0.0097|
worker-0 >> |lighteval:ceval_zho_mcf:_average:0 | |acc_ |0.7154|± |0.0827|
worker-0 >> |lighteval:ceval_zho_mcf:accountant:0 | 0|acc_ |0.7755|± |0.0602|
worker-0 >> |lighteval:ceval_zho_mcf:advanced_mathematics:0 | 0|acc_ |0.2105|± |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:art_studies:0 | 0|acc_ |0.6970|± |0.0812|
worker-0 >> |lighteval:ceval_zho_mcf:basic_medicine:0 | 0|acc_ |0.7895|± |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:business_administration:0 | 0|acc_ |0.7273|± |0.0787|
worker-0 >> |lighteval:ceval_zho_mcf:chinese_language_and_literature:0 | 0|acc_ |0.4783|± |0.1065|
worker-0 >> |lighteval:ceval_zho_mcf:civil_servant:0 | 0|acc_ |0.6087|± |0.0728|
worker-0 >> |lighteval:ceval_zho_mcf:clinical_medicine:0 | 0|acc_ |0.5909|± |0.1073|
worker-0 >> |lighteval:ceval_zho_mcf:college_chemistry:0 | 0|acc_ |0.4583|± |0.1039|
worker-0 >> |lighteval:ceval_zho_mcf:college_economics:0 | 0|acc_ |0.6000|± |0.0667|
worker-0 >> |lighteval:ceval_zho_mcf:college_physics:0 | 0|acc_ |0.5789|± |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:college_programming:0 | 0|acc_ |0.6486|± |0.0796|
worker-0 >> |lighteval:ceval_zho_mcf:computer_architecture:0 | 0|acc_ |0.6667|± |0.1054|
worker-0 >> |lighteval:ceval_zho_mcf:computer_network:0 | 0|acc_ |0.8947|± |0.0723|
worker-0 >> |lighteval:ceval_zho_mcf:discrete_mathematics:0 | 0|acc_ |0.3750|± |0.1250|
worker-0 >> |lighteval:ceval_zho_mcf:education_science:0 | 0|acc_ |0.8966|± |0.0576|
worker-0 >> |lighteval:ceval_zho_mcf:electrical_engineer:0 | 0|acc_ |0.4054|± |0.0818|
worker-0 >> |lighteval:ceval_zho_mcf:environmental_impact_assessment_engineer:0| 0|acc_ |0.6129|± |0.0889|
worker-0 >> |lighteval:ceval_zho_mcf:fire_engineer:0 | 0|acc_ |0.6452|± |0.0874|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_biology:0 | 0|acc_ |0.7895|± |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_chemistry:0 | 0|acc_ |0.6316|± |0.1137|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_chinese:0 | 0|acc_ |0.5789|± |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_geography:0 | 0|acc_ |0.8889|± |0.0762|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_history:0 | 0|acc_ |0.8000|± |0.0918|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_mathematics:0 | 0|acc_ |0.3889|± |0.1182|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_physics:0 | 0|acc_ |0.7895|± |0.0961|
worker-0 >> |lighteval:ceval_zho_mcf:high_school_politics:0 | 0|acc_ |0.9474|± |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:ideological_and_moral_cultivation:0 | 0|acc_ |0.9474|± |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:law:0 | 0|acc_ |0.6667|± |0.0983|
worker-0 >> |lighteval:ceval_zho_mcf:legal_professional:0 | 0|acc_ |0.6957|± |0.0981|
worker-0 >> |lighteval:ceval_zho_mcf:logic:0 | 0|acc_ |0.5455|± |0.1087|
worker-0 >> |lighteval:ceval_zho_mcf:mao_zedong_thought:0 | 0|acc_ |0.8333|± |0.0777|
worker-0 >> |lighteval:ceval_zho_mcf:marxism:0 | 0|acc_ |0.9474|± |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:metrology_engineer:0 | 0|acc_ |0.8333|± |0.0777|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_biology:0 | 0|acc_ |1.0000|± |0.0000|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_chemistry:0 | 0|acc_ |1.0000|± |0.0000|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_geography:0 | 0|acc_ |0.8182|± |0.1220|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_history:0 | 0|acc_ |0.9545|± |0.0455|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_mathematics:0 | 0|acc_ |0.5789|± |0.1164|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_physics:0 | 0|acc_ |0.9474|± |0.0526|
worker-0 >> |lighteval:ceval_zho_mcf:middle_school_politics:0 | 0|acc_ |0.9048|± |0.0656|
worker-0 >> |lighteval:ceval_zho_mcf:modern_chinese_history:0 | 0|acc_ |0.8696|± |0.0718|
worker-0 >> |lighteval:ceval_zho_mcf:operating_system:0 | 0|acc_ |0.6842|± |0.1096|
worker-0 >> |lighteval:ceval_zho_mcf:physician:0 | 0|acc_ |0.6735|± |0.0677|
worker-0 >> |lighteval:ceval_zho_mcf:plant_protection:0 | 0|acc_ |0.7727|± |0.0914|
worker-0 >> |lighteval:ceval_zho_mcf:probability_and_statistics:0 | 0|acc_ |0.3333|± |0.1143|
worker-0 >> |lighteval:ceval_zho_mcf:professional_tour_guide:0 | 0|acc_ |0.8621|± |0.0652|
worker-0 >> |lighteval:ceval_zho_mcf:sports_science:0 | 0|acc_ |0.8421|± |0.0859|
worker-0 >> |lighteval:ceval_zho_mcf:tax_accountant:0 | 0|acc_ |0.6939|± |0.0665|
worker-0 >> |lighteval:ceval_zho_mcf:teacher_qualification:0 | 0|acc_ |0.8636|± |0.0523|
worker-0 >> |lighteval:ceval_zho_mcf:urban_and_rural_planner:0 | 0|acc_ |0.7609|± |0.0636|
worker-0 >> |lighteval:ceval_zho_mcf:veterinary_medicine:0 | 0|acc_ |0.6957|± |0.0981|
worker-0 >> |lighteval:chinese_squad_zho:0 | 0|exact_match_zho_prefix|0.5590|± |0.0157|
worker-0 >> | | |f1_zho |0.5994|± |0.0125|
worker-0 >> |lighteval:cmmlu_zho_mcf:_average:0 | |acc_ |0.7079|± |0.0341|
worker-0 >> |lighteval:cmmlu_zho_mcf:agronomy:0 | 0|acc_ |0.6154|± |0.0375|
worker-0 >> |lighteval:cmmlu_zho_mcf:anatomy:0 | 0|acc_ |0.7432|± |0.0360|
worker-0 >> |lighteval:cmmlu_zho_mcf:ancient_chinese:0 | 0|acc_ |0.4268|± |0.0387|
worker-0 >> |lighteval:cmmlu_zho_mcf:arts:0 | 0|acc_ |0.8938|± |0.0244|
worker-0 >> |lighteval:cmmlu_zho_mcf:astronomy:0 | 0|acc_ |0.4727|± |0.0390|
worker-0 >> |lighteval:cmmlu_zho_mcf:business_ethics:0 | 0|acc_ |0.6603|± |0.0328|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_civil_service_exam:0 | 0|acc_ |0.7188|± |0.0357|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_driving_rule:0 | 0|acc_ |0.9466|± |0.0197|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_food_culture:0 | 0|acc_ |0.6471|± |0.0411|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_foreign_policy:0 | 0|acc_ |0.7290|± |0.0432|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_history:0 | 0|acc_ |0.8390|± |0.0205|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_literature:0 | 0|acc_ |0.6225|± |0.0340|
worker-0 >> |lighteval:cmmlu_zho_mcf:chinese_teacher_qualification:0 | 0|acc_ |0.8715|± |0.0251|
worker-0 >> |lighteval:cmmlu_zho_mcf:clinical_knowledge:0 | 0|acc_ |0.6962|± |0.0299|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_actuarial_science:0 | 0|acc_ |0.3302|± |0.0459|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_education:0 | 0|acc_ |0.8318|± |0.0363|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_engineering_hydrology:0 | 0|acc_ |0.6887|± |0.0452|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_law:0 | 0|acc_ |0.6204|± |0.0469|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_mathematics:0 | 0|acc_ |0.2667|± |0.0434|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_medical_statistics:0 | 0|acc_ |0.6415|± |0.0468|
worker-0 >> |lighteval:cmmlu_zho_mcf:college_medicine:0 | 0|acc_ |0.7985|± |0.0243|
worker-0 >> |lighteval:cmmlu_zho_mcf:computer_science:0 | 0|acc_ |0.8186|± |0.0270|
worker-0 >> |lighteval:cmmlu_zho_mcf:computer_security:0 | 0|acc_ |0.8480|± |0.0275|
worker-0 >> |lighteval:cmmlu_zho_mcf:conceptual_physics:0 | 0|acc_ |0.8231|± |0.0316|
worker-0 >> |lighteval:cmmlu_zho_mcf:construction_project_management:0 | 0|acc_ |0.5899|± |0.0419|
worker-0 >> |lighteval:cmmlu_zho_mcf:economics:0 | 0|acc_ |0.7358|± |0.0351|
worker-0 >> |lighteval:cmmlu_zho_mcf:education:0 | 0|acc_ |0.7485|± |0.0341|
worker-0 >> |lighteval:cmmlu_zho_mcf:electricWARNING:lighteval.logging.hierarchical_logger: Saving experiment tracker
worker-0 >> al_engineering:0 | 0|acc_ |0.7965|± |0.0308|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_chinese:0 | 0|acc_ |0.7421|± |0.0276|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_commonsense:0 | 0|acc_ |0.7525|± |0.0307|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_information_and_technology:0 | 0|acc_ |0.8866|± |0.0206|
worker-0 >> |lighteval:cmmlu_zho_mcf:elementary_mathematics:0 | 0|acc_ |0.5174|± |0.0330|
worker-0 >> |lighteval:cmmlu_zho_mcf:ethnology:0 | 0|acc_ |0.6815|± |0.0402|
worker-0 >> |lighteval:cmmlu_zho_mcf:food_science:0 | 0|acc_ |0.6993|± |0.0385|
worker-0 >> |lighteval:cmmlu_zho_mcf:genetics:0 | 0|acc_ |0.6250|± |0.0366|
worker-0 >> |lighteval:cmmlu_zho_mcf:global_facts:0 | 0|acc_ |0.7517|± |0.0355|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_biology:0 | 0|acc_ |0.7574|± |0.0331|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_chemistry:0 | 0|acc_ |0.7045|± |0.0399|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_geography:0 | 0|acc_ |0.7119|± |0.0419|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_mathematics:0 | 0|acc_ |0.5305|± |0.0391|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_physics:0 | 0|acc_ |0.6636|± |0.0453|
worker-0 >> |lighteval:cmmlu_zho_mcf:high_school_politics:0 | 0|acc_ |0.6503|± |0.0400|
worker-0 >> |lighteval:cmmlu_zho_mcf:human_sexuality:0 | 0|acc_ |0.5794|± |0.0442|
worker-0 >> |lighteval:cmmlu_zho_mcf:international_law:0 | 0|acc_ |0.5784|± |0.0364|
worker-0 >> |lighteval:cmmlu_zho_mcf:journalism:0 | 0|acc_ |0.6453|± |0.0366|
worker-0 >> |lighteval:cmmlu_zho_mcf:jurisprudence:0 | 0|acc_ |0.7859|± |0.0203|
worker-0 >> |lighteval:cmmlu_zho_mcf:legal_and_moral_basis:0 | 0|acc_ |0.9766|± |0.0104|
worker-0 >> |lighteval:cmmlu_zho_mcf:logical:0 | 0|acc_ |0.6179|± |0.0440|
worker-0 >> |lighteval:cmmlu_zho_mcf:machine_learning:0 | 0|acc_ |0.6803|± |0.0424|
worker-0 >> |lighteval:cmmlu_zho_mcf:management:0 | 0|acc_ |0.8238|± |0.0264|
worker-0 >> |lighteval:cmmlu_zho_mcf:marketing:0 | 0|acc_ |0.7722|± |0.0313|
worker-0 >> |lighteval:cmmlu_zho_mcf:marxist_theory:0 | 0|acc_ |0.9418|± |0.0171|
worker-0 >> |lighteval:cmmlu_zho_mcf:modern_chinese:0 | 0|acc_ |0.5172|± |0.0466|
worker-0 >> |lighteval:cmmlu_zho_mcf:nutrition:0 | 0|acc_ |0.7103|± |0.0378|
worker-0 >> |lighteval:cmmlu_zho_mcf:philosophy:0 | 0|acc_ |0.8381|± |0.0361|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_accounting:0 | 0|acc_ |0.8400|± |0.0278|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_law:0 | 0|acc_ |0.6730|± |0.0324|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_medicine:0 | 0|acc_ |0.6516|± |0.0246|
worker-0 >> |lighteval:cmmlu_zho_mcf:professional_psychology:0 | 0|acc_ |0.8405|± |0.0241|
worker-0 >> |lighteval:cmmlu_zho_mcf:public_relations:0 | 0|acc_ |0.6667|± |0.0358|
worker-0 >> |lighteval:cmmlu_zho_mcf:security_study:0 | 0|acc_ |0.7926|± |0.0350|
worker-0 >> |lighteval:cmmlu_zho_mcf:sociology:0 | 0|acc_ |0.6903|± |0.0308|
worker-0 >> |lighteval:cmmlu_zho_mcf:sports_science:0 | 0|acc_ |0.7273|± |0.0348|
worker-0 >> |lighteval:cmmlu_zho_mcf:traditional_chinese_medicine:0 | 0|acc_ |0.7189|± |0.0331|
worker-0 >> |lighteval:cmmlu_zho_mcf:virology:0 | 0|acc_ |0.7633|± |0.0328|
worker-0 >> |lighteval:cmmlu_zho_mcf:world_history:0 | 0|acc_ |0.7826|± |0.0326|
worker-0 >> |lighteval:cmmlu_zho_mcf:world_religions:0 | 0|acc_ |0.7188|± |0.0357|
worker-0 >> |lighteval:cmrc2018_zho:0 | 0|exact_match_zho_prefix|0.3750|± |0.0153|
worker-0 >> | | |f1_zho |0.6913|± |0.0101|
worker-0 >> |lighteval:m3exams_zho_mcf:0 | 0|acc_ |0.7871|± |0.0157|
worker-0 >> |lighteval:mlmm_hellaswag_zho_mcf:0 | 0|acc_ |0.3380|± |0.0150|
worker-0 >> |lighteval:ocnli_zho_mcf:0 | 0|acc_ |0.7650|± |0.0134|
worker-0 >> |lighteval:xcodah_zho_mcf:0 | 0|acc_ |0.4767|± |0.0289|
worker-0 >> |lighteval:xcopa_zho_mcf:0 | 0|acc_ |0.8320|± |0.0167|
worker-0 >> |lighteval:xcsqa_zho_mcf:0 | 0|acc_ |0.5250|± |0.0158|
worker-0 >> |lighteval:xstory_cloze_zho_mcf:0 | 0|acc_ |0.8900|± |0.0099|
worker-0 >> |lighteval:xwinograd_zho_mcf:0 | 0|acc_ |0.6250|± |0.0216|
worker-0 >> | | |acc_norm_token |0.6250|± |0.0216|
worker-0 >> | | |acc_norm |0.6250|± |0.0216|
They differ somewhat from the results reported in your blog.
Could you let me know how the scores are computed precisely so I can verify my setup?