ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

DeepSeek 模型 MMLU数据集的精度测试结果偏低

Open shawn9977 opened this issue 6 months ago • 1 comments

镜像:使用intelanalytics/ipex-llm-serving-xpu:0.8.3-b19 或者intelanalytics/ipex-llm-serving-xpu:0.8.3-b21镜像 模型: DeepSeek-R1-Distill-Qwen-32B SYM_INT4 模型 工具: Lighteval 数据集 :MMLU

Benchmark后精度结果值偏低才27.67%。 DeepSeek-R1-Distill-Qwen-32B INT4 模型 在NV A100上Benchmark的精度值为78.82%

(WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance [repeated 2x across cluster] (WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 24x across cluster] (WrapperWithLoadBit pid=10769) -----> current rank: 3, world size: 4, byte_count: 15360000,is_p2p:1 [repeated 2x across cluster] (WrapperWithLoadBit pid=10769) WARNING 06-13 12:30:19 [_logger.py:68] Pin memory is not supported on XPU. [repeated 2x across cluster] [2025-06-13 15:38:57,787] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:498) [2025-06-13 15:38:58,608] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:540)

Task Version Metric Value Stderr
all acc 0.2767 ± 0.0332
original:mmlu:_average:0 acc 0.2767 ± 0.0332
original:mmlu:abstract_algebra:0 0 acc 0.2200 ± 0.0416
original:mmlu:anatomy:0 0 acc 0.2370 ± 0.0367
original:mmlu:astronomy:0 0 acc 0.2500 ± 0.0352
original:mmlu:business_ethics:0 0 acc 0.3800 ± 0.0488
original:mmlu:clinical_knowledge:0 0 acc 0.2340 ± 0.0261
original:mmlu:college_biology:0 0 acc 0.3125 ± 0.0388
original:mmlu:college_chemistry:0 0 acc 0.2000 ± 0.0402
original:mmlu:college_computer_science:0 0 acc 0.2700 ± 0.0446
original:mmlu:college_mathematics:0 0 acc 0.2100 ± 0.0409
original:mmlu:college_medicine:0 0 acc 0.2254 ± 0.0319
original:mmlu:college_physics:0 0 acc 0.2157 ± 0.0409
original:mmlu:computer_security:0 0 acc 0.3300 ± 0.0473
original:mmlu:conceptual_physics:0 0 acc 0.3064 ± 0.0301
original:mmlu:econometrics:0 0 acc 0.2368 ± 0.0400
original:mmlu:electrical_engineering:0 0 acc 0.2759 ± 0.0372
original:mmlu:elementary_mathematics:0 0 acc 0.2249 ± 0.0215
original:mmlu:formal_logic:0 0 acc 0.2778 ± 0.0401
original:mmlu:global_facts:0 0 acc 0.2100 ± 0.0409
original:mmlu:high_school_biology:0 0 acc 0.2226 ± 0.0237
original:mmlu:high_school_chemistry:0 0 acc 0.1823 ± 0.0272
original:mmlu:high_school_computer_science:0 0 acc 0.2900 ± 0.0456
original:mmlu:high_school_european_history:0 0 acc 0.3212 ± 0.0365
original:mmlu:high_school_geography:0 0 acc 0.3030 ± 0.0327
original:mmlu:high_school_government_and_politics:0 0 acc 0.2176 ± 0.0298
original:mmlu:high_school_macroeconomics:0 0 acc 0.2538 ± 0.0221
original:mmlu:high_school_mathematics:0 0 acc 0.2111 ± 0.0249
original:mmlu:high_school_microeconomics:0 0 acc 0.2563 ± 0.0284
original:mmlu:high_school_physics:0 0 acc 0.1987 ± 0.0326
original:mmlu:high_school_psychology:0 0 acc 0.3523 ± 0.0205
original:mmlu:high_school_statistics:0 0 acc 0.1620 ± 0.0251
original:mmlu:high_school_us_history:0 0 acc 0.2990 ± 0.0321
original:mmlu:high_school_world_history:0 0 acc 0.3882 ± 0.0317
original:mmlu:human_aging:0 0 acc 0.3453 ± 0.0319
original:mmlu:human_sexuality:0 0 acc 0.3359 ± 0.0414
original:mmlu:international_law:0 0 acc 0.2893 ± 0.0414
original:mmlu:jurisprudence:0 0 acc 0.2963 ± 0.0441
original:mmlu:logical_fallacies:0 0 acc 0.3313 ± 0.0370
original:mmlu:machine_learning:0 0 acc 0.3214 ± 0.0443
original:mmlu:management:0 0 acc 0.2718 ± 0.0441
original:mmlu:marketing:0 0 acc 0.4316 ± 0.0324
original:mmlu:medical_genetics:0 0 acc 0.3000 ± 0.0461
original:mmlu:miscellaneous:0 0 acc 0.3614 ± 0.0172
original:mmlu:moral_disputes:0 0 acc 0.2919 ± 0.0245
original:mmlu:moral_scenarios:0 0 acc 0.2402 ± 0.0143
original:mmlu:nutrition:0 0 acc 0.2516 ± 0.0248
original:mmlu:philosophy:0 0 acc 0.2379 ± 0.0242
original:mmlu:prehistory:0 0 acc 0.2809 ± 0.0250
original:mmlu:professional_accounting:0 0 acc 0.2411 ± 0.0255
original:mmlu:professional_law:0 0 acc 0.2477 ± 0.0110
original:mmlu:professional_medicine:0 0 acc 0.1875 ± 0.0237
original:mmlu:professional_psychology:0 0 acc 0.3105 ± 0.0187
original:mmlu:public_relations:0 0 acc 0.2818 ± 0.0431
original:mmlu:security_studies:0 0 acc 0.2939 ± 0.0292
original:mmlu:sociology:0 0 acc 0.2985 ± 0.0324
original:mmlu:us_foreign_policy:0 0 acc 0.3200 ± 0.0469
original:mmlu:virology:0 0 acc 0.2892 ± 0.0353
original:mmlu:world_religions:0 0 acc 0.4386 ± 0.0381

[2025-06-13 15:38:58,686] [ INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:530) [2025-06-13 15:38:58,686] [ INFO]: Saving experiment tracker (evaluation_tracker.py:196) [2025-06-13 15:39:07,447] [ INFO]: Saving results to /llm/intelmc8/shawn/project/lighteval/results/results/_llm_intelmc8_models_DeepSeek-R1-Distill-Qwen-32B/results_2025-06-13T15-38-58.686645.json (evaluation_tracker.py:265)

shawn9977 avatar Jun 23 '25 11:06 shawn9977

We’ve already synced with the user on Teams regarding this issue. After switching the evaluation framework from LightEval to EleutherAI's harness, the MMLU accuracy improved significantly. The evaluation scripts have also been shared with the user via Teams. Please feel free to reach out if any further evaluation support is needed.

liu-shaojun avatar Jun 25 '25 01:06 liu-shaojun