DeepSeek 模型 MMLU数据集的精度测试结果偏低
镜像:使用intelanalytics/ipex-llm-serving-xpu:0.8.3-b19 或者intelanalytics/ipex-llm-serving-xpu:0.8.3-b21镜像 模型: DeepSeek-R1-Distill-Qwen-32B SYM_INT4 模型 工具: Lighteval 数据集 :MMLU
Benchmark后精度结果值偏低才27.67%。 DeepSeek-R1-Distill-Qwen-32B INT4 模型 在NV A100上Benchmark的精度值为78.82%
(WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance [repeated 2x across cluster] (WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 24x across cluster] (WrapperWithLoadBit pid=10769) -----> current rank: 3, world size: 4, byte_count: 15360000,is_p2p:1 [repeated 2x across cluster] (WrapperWithLoadBit pid=10769) WARNING 06-13 12:30:19 [_logger.py:68] Pin memory is not supported on XPU. [repeated 2x across cluster] [2025-06-13 15:38:57,787] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:498) [2025-06-13 15:38:58,608] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:540)
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| all | acc | 0.2767 | ± | 0.0332 | |
| original:mmlu:_average:0 | acc | 0.2767 | ± | 0.0332 | |
| original:mmlu:abstract_algebra:0 | 0 | acc | 0.2200 | ± | 0.0416 |
| original:mmlu:anatomy:0 | 0 | acc | 0.2370 | ± | 0.0367 |
| original:mmlu:astronomy:0 | 0 | acc | 0.2500 | ± | 0.0352 |
| original:mmlu:business_ethics:0 | 0 | acc | 0.3800 | ± | 0.0488 |
| original:mmlu:clinical_knowledge:0 | 0 | acc | 0.2340 | ± | 0.0261 |
| original:mmlu:college_biology:0 | 0 | acc | 0.3125 | ± | 0.0388 |
| original:mmlu:college_chemistry:0 | 0 | acc | 0.2000 | ± | 0.0402 |
| original:mmlu:college_computer_science:0 | 0 | acc | 0.2700 | ± | 0.0446 |
| original:mmlu:college_mathematics:0 | 0 | acc | 0.2100 | ± | 0.0409 |
| original:mmlu:college_medicine:0 | 0 | acc | 0.2254 | ± | 0.0319 |
| original:mmlu:college_physics:0 | 0 | acc | 0.2157 | ± | 0.0409 |
| original:mmlu:computer_security:0 | 0 | acc | 0.3300 | ± | 0.0473 |
| original:mmlu:conceptual_physics:0 | 0 | acc | 0.3064 | ± | 0.0301 |
| original:mmlu:econometrics:0 | 0 | acc | 0.2368 | ± | 0.0400 |
| original:mmlu:electrical_engineering:0 | 0 | acc | 0.2759 | ± | 0.0372 |
| original:mmlu:elementary_mathematics:0 | 0 | acc | 0.2249 | ± | 0.0215 |
| original:mmlu:formal_logic:0 | 0 | acc | 0.2778 | ± | 0.0401 |
| original:mmlu:global_facts:0 | 0 | acc | 0.2100 | ± | 0.0409 |
| original:mmlu:high_school_biology:0 | 0 | acc | 0.2226 | ± | 0.0237 |
| original:mmlu:high_school_chemistry:0 | 0 | acc | 0.1823 | ± | 0.0272 |
| original:mmlu:high_school_computer_science:0 | 0 | acc | 0.2900 | ± | 0.0456 |
| original:mmlu:high_school_european_history:0 | 0 | acc | 0.3212 | ± | 0.0365 |
| original:mmlu:high_school_geography:0 | 0 | acc | 0.3030 | ± | 0.0327 |
| original:mmlu:high_school_government_and_politics:0 | 0 | acc | 0.2176 | ± | 0.0298 |
| original:mmlu:high_school_macroeconomics:0 | 0 | acc | 0.2538 | ± | 0.0221 |
| original:mmlu:high_school_mathematics:0 | 0 | acc | 0.2111 | ± | 0.0249 |
| original:mmlu:high_school_microeconomics:0 | 0 | acc | 0.2563 | ± | 0.0284 |
| original:mmlu:high_school_physics:0 | 0 | acc | 0.1987 | ± | 0.0326 |
| original:mmlu:high_school_psychology:0 | 0 | acc | 0.3523 | ± | 0.0205 |
| original:mmlu:high_school_statistics:0 | 0 | acc | 0.1620 | ± | 0.0251 |
| original:mmlu:high_school_us_history:0 | 0 | acc | 0.2990 | ± | 0.0321 |
| original:mmlu:high_school_world_history:0 | 0 | acc | 0.3882 | ± | 0.0317 |
| original:mmlu:human_aging:0 | 0 | acc | 0.3453 | ± | 0.0319 |
| original:mmlu:human_sexuality:0 | 0 | acc | 0.3359 | ± | 0.0414 |
| original:mmlu:international_law:0 | 0 | acc | 0.2893 | ± | 0.0414 |
| original:mmlu:jurisprudence:0 | 0 | acc | 0.2963 | ± | 0.0441 |
| original:mmlu:logical_fallacies:0 | 0 | acc | 0.3313 | ± | 0.0370 |
| original:mmlu:machine_learning:0 | 0 | acc | 0.3214 | ± | 0.0443 |
| original:mmlu:management:0 | 0 | acc | 0.2718 | ± | 0.0441 |
| original:mmlu:marketing:0 | 0 | acc | 0.4316 | ± | 0.0324 |
| original:mmlu:medical_genetics:0 | 0 | acc | 0.3000 | ± | 0.0461 |
| original:mmlu:miscellaneous:0 | 0 | acc | 0.3614 | ± | 0.0172 |
| original:mmlu:moral_disputes:0 | 0 | acc | 0.2919 | ± | 0.0245 |
| original:mmlu:moral_scenarios:0 | 0 | acc | 0.2402 | ± | 0.0143 |
| original:mmlu:nutrition:0 | 0 | acc | 0.2516 | ± | 0.0248 |
| original:mmlu:philosophy:0 | 0 | acc | 0.2379 | ± | 0.0242 |
| original:mmlu:prehistory:0 | 0 | acc | 0.2809 | ± | 0.0250 |
| original:mmlu:professional_accounting:0 | 0 | acc | 0.2411 | ± | 0.0255 |
| original:mmlu:professional_law:0 | 0 | acc | 0.2477 | ± | 0.0110 |
| original:mmlu:professional_medicine:0 | 0 | acc | 0.1875 | ± | 0.0237 |
| original:mmlu:professional_psychology:0 | 0 | acc | 0.3105 | ± | 0.0187 |
| original:mmlu:public_relations:0 | 0 | acc | 0.2818 | ± | 0.0431 |
| original:mmlu:security_studies:0 | 0 | acc | 0.2939 | ± | 0.0292 |
| original:mmlu:sociology:0 | 0 | acc | 0.2985 | ± | 0.0324 |
| original:mmlu:us_foreign_policy:0 | 0 | acc | 0.3200 | ± | 0.0469 |
| original:mmlu:virology:0 | 0 | acc | 0.2892 | ± | 0.0353 |
| original:mmlu:world_religions:0 | 0 | acc | 0.4386 | ± | 0.0381 |
[2025-06-13 15:38:58,686] [ INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:530) [2025-06-13 15:38:58,686] [ INFO]: Saving experiment tracker (evaluation_tracker.py:196) [2025-06-13 15:39:07,447] [ INFO]: Saving results to /llm/intelmc8/shawn/project/lighteval/results/results/_llm_intelmc8_models_DeepSeek-R1-Distill-Qwen-32B/results_2025-06-13T15-38-58.686645.json (evaluation_tracker.py:265)
We’ve already synced with the user on Teams regarding this issue. After switching the evaluation framework from LightEval to EleutherAI's harness, the MMLU accuracy improved significantly. The evaluation scripts have also been shared with the user via Teams. Please feel free to reach out if any further evaluation support is needed.