eval-scope
eval-scope copied to clipboard
return _winapi.DuplicateHandle( OSError: [WinError 6] 句柄无效。
自查清单
在提交 issue 之前,请确保您已完成以下步骤:
问题描述
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
new_handle = reduction.duplicate(pipe_handle,
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews: 69%|██████▉ | 605/880 [01:40<01:44, 2.64it/s]2025-05-24 17:35:58,787 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
EvalScope 版本(必填)
v0.16.0
使用的工具
- [x ] Native / 原生框架
- [ ] Opencompass backend
- [ ] VLMEvalKit backend
- [ ] RAGEval backend
- [ ] Perf / 模型推理压测工具
- [ ] Arena / 竞技场模式
执行的代码或指令
from evalscope import TaskConfig, run_task
dataset_path=r'C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl'
task_cfg = TaskConfig(
model='Qwen3-14B-AWQ',
api_url='http://127.0.0.1:8889/v1/chat/completions',
eval_type='service',
datasets=[
'data_collection',
],
dataset_args={
'data_collection': {
'dataset_id': dataset_path,
'filters': {'remove_until': '</think>'} # 过滤掉思考的内容
}
},
eval_batch_size=15,
generation_config={
'max_tokens': 30000, # 最大生成token数,建议设置为较大值避免输出截断
'temperature': 0.6, # 采样温度 (qwen 报告推荐值)
'top_p': 0.95, # top-p采样 (qwen 报告推荐值)
'top_k': 20, # top-k采样 (qwen 报告推荐值)
'n': 1, # 每个请求产生的回复数量
},
timeout=60000, # 超时时间
stream=True, # 是否使用流式输出
# limit=2000, # 设置为100条数据进行测试
use_cache=r"./outputs/20250524_152506"
)
run_task(task_cfg=task_cfg)
错误日志
C:\ProgramData\anaconda3\envs\evalscope\python.exe C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\H800\autodl_h800.py
2025-05-24 17:34:00,366 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:00,381 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:04,916 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:10,057 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:15,274 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:15,276 - evalscope - INFO - {
"model": "Qwen3-14B-AWQ",
"model_id": "Qwen3-14B-AWQ",
"model_args": {},
"model_task": "text_generation",
"template_type": null,
"chat_template": null,
"datasets": [
"data_collection"
],
"dataset_args": {
"data_collection": {
"dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
"filters": {
"remove_until": "</think>"
}
}
},
"dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_tokens": 30000,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"n": 1
},
"eval_type": "service",
"eval_backend": "Native",
"eval_config": null,
"stage": "all",
"limit": null,
"eval_batch_size": 15,
"mem_cache": false,
"use_cache": "./outputs/20250524_152506",
"work_dir": "./outputs/20250524_152506",
"outputs": null,
"ignore_errors": false,
"debug": false,
"dry_run": false,
"seed": 42,
"api_url": "http://127.0.0.1:8889/v1/chat/completions",
"api_key": "EMPTY",
"timeout": 60000,
"stream": true,
"judge_strategy": "auto",
"judge_worker_num": 1,
"judge_model_args": {}
}
2025-05-24 17:34:16,135 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:16,136 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews: 0%| | 0/880 [00:00<?, ?it/s]2025-05-24 17:34:18,363 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:18,376 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:23,077 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:28,213 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:33,327 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:33,330 - evalscope - INFO - {
"model": "Qwen3-14B-AWQ",
"model_id": "Qwen3-14B-AWQ",
"model_args": {},
"model_task": "text_generation",
"template_type": null,
"chat_template": null,
"datasets": [
"data_collection"
],
"dataset_args": {
"data_collection": {
"dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
"filters": {
"remove_until": "</think>"
}
}
},
"dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_tokens": 30000,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"n": 1
},
"eval_type": "service",
"eval_backend": "Native",
"eval_config": null,
"stage": "all",
"limit": null,
"eval_batch_size": 15,
"mem_cache": false,
"use_cache": "./outputs/20250524_152506",
"work_dir": "./outputs/20250524_152506",
"outputs": null,
"ignore_errors": false,
"debug": false,
"dry_run": false,
"seed": 42,
"api_url": "http://127.0.0.1:8889/v1/chat/completions",
"api_key": "EMPTY",
"timeout": 60000,
"stream": true,
"judge_strategy": "auto",
"judge_worker_num": 1,
"judge_model_args": {}
}
2025-05-24 17:34:34,311 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:34,313 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews: 100%|██████████| 880/880 [00:01<00:00, 578.38it/s]
Getting scores: 100%|██████████| 880/880 [00:00<00:00, 381103.51it/s]
2025-05-24 17:34:35,912 - evalscope - INFO - subset_level Report:
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| task_type | metric | dataset_name | subset_name | average_score | count |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| code | Pass@1 | live_code_bench | v5_v6 | 0.0 | 90 |
| math | AverageAccuracy | gsm8k | main | 0.9444 | 90 |
| reasoning | AverageAccuracy | hellaswag | default | 0.7889 | 90 |
| knowledge | AveragePass@1 | gpqa | gpqa_diamond | 0.6333 | 90 |
| exam | AverageAccuracy | iquiz | EQ | 0.6866 | 67 |
| reasoning | AverageAccuracy | arc | ARC-Easy | 0.9661 | 59 |
| exam | AverageAccuracy | iquiz | IQ | 0.8182 | 33 |
| reasoning | AverageAccuracy | arc | ARC-Challenge | 1.0 | 31 |
| math | AveragePass@1 | aime24 | default | 0.8 | 30 |
| general | AverageAccuracy | mmlu_pro | physics | 0.9412 | 17 |
| math | AveragePass@1 | aime25 | AIME2025-I | 0.6 | 15 |
| math | AveragePass@1 | aime25 | AIME2025-II | 0.7333 | 15 |
| general | AverageAccuracy | mmlu_pro | economics | 1.0 | 11 |
| general | AverageAccuracy | mmlu_pro | math | 1.0 | 10 |
| general | AverageAccuracy | mmlu_pro | law | 0.625 | 8 |
| general | AverageAccuracy | mmlu_pro | other | 0.875 | 8 |
| general | AverageAccuracy | mmlu_pro | business | 0.8571 | 7 |
| general | AverageAccuracy | mmlu_pro | psychology | 0.7143 | 7 |
| general | AverageAccuracy | mmlu_pro | chemistry | 1.0 | 6 |
| general | AverageAccuracy | mmlu_redux | management | 0.8 | 5 |
| general | AverageAccuracy | mmlu_pro | health | 0.8 | 5 |
| exam | AverageAccuracy | ceval | civil_servant | 0.8 | 5 |
| exam | AverageAccuracy | ceval | tax_accountant | 0.8 | 5 |
| exam | AverageAccuracy | ceval | discrete_mathematics | 0.25 | 4 |
| exam | AverageAccuracy | ceval | college_economics | 0.75 | 4 |
| general | AverageAccuracy | mmlu_pro | biology | 0.75 | 4 |
| general | AverageAccuracy | mmlu_redux | high_school_macroeconomics | 1.0 | 4 |
| general | AverageAccuracy | mmlu_redux | jurisprudence | 0.75 | 4 |
| general | AverageAccuracy | mmlu_redux | high_school_world_history | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | high_school_mathematics | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | security_studies | 0.6667 | 3 |
| general | AverageAccuracy | mmlu_redux | professional_medicine | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | high_school_government_and_politics | 1.0 | 3 |
| general | AverageAccuracy | mmlu_pro | philosophy | 0.3333 | 3 |
| exam | AverageAccuracy | ceval | high_school_physics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | high_school_politics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | business_administration | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | teacher_qualification | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | abstract_algebra | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | astronomy | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | anatomy | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | middle_school_politics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | probability_and_statistics | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | physician | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | mao_zedong_thought | 1.0 | 2 |
| exam | AverageAccuracy | ceval | college_chemistry | 1.0 | 2 |
| exam | AverageAccuracy | ceval | metrology_engineer | 0.5 | 2 |
| exam | AverageAccuracy | ceval | environmental_impact_assessment_engineer | 1.0 | 2 |
| exam | AverageAccuracy | ceval | electrical_engineer | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_psychology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_microeconomics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | professional_accounting | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | professional_psychology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | nutrition | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | philosophy | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_statistics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_us_history | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | public_relations | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | us_foreign_policy | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_geography | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | elementary_mathematics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | college_physics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | middle_school_mathematics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | plant_protection | 1.0 | 2 |
| exam | AverageAccuracy | ceval | middle_school_biology | 1.0 | 2 |
| exam | AverageAccuracy | ceval | high_school_biology | 1.0 | 2 |
| exam | AverageAccuracy | ceval | high_school_history | 1.0 | 2 |
| exam | AverageAccuracy | ceval | legal_professional | 0.5 | 2 |
| exam | AverageAccuracy | ceval | high_school_mathematics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | chinese_language_and_literature | 1.0 | 2 |
| exam | AverageAccuracy | ceval | clinical_medicine | 0.5 | 2 |
| exam | AverageAccuracy | ceval | accountant | 1.0 | 2 |
| exam | AverageAccuracy | ceval | art_studies | 1.0 | 2 |
| exam | AverageAccuracy | ceval | computer_architecture | 1.0 | 2 |
| exam | AverageAccuracy | ceval | education_science | 1.0 | 2 |
| exam | AverageAccuracy | ceval | college_programming | 1.0 | 2 |
| general | AverageAccuracy | mmlu_pro | engineering | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | clinical_knowledge | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | econometrics | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | college_mathematics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | college_computer_science | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_biology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | formal_logic | 1.0 | 2 |
| exam | AverageAccuracy | ceval | urban_and_rural_planner | 1.0 | 1 |
| exam | AverageAccuracy | ceval | sports_science | 1.0 | 1 |
| exam | AverageAccuracy | ceval | marxism | 0.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_physics | 1.0 | 1 |
| exam | AverageAccuracy | ceval | modern_chinese_history | 0.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_history | 1.0 | 1 |
| exam | AverageAccuracy | ceval | law | 1.0 | 1 |
| exam | AverageAccuracy | ceval | ideological_and_moral_cultivation | 1.0 | 1 |
| exam | AverageAccuracy | ceval | logic | 1.0 | 1 |
| exam | AverageAccuracy | ceval | basic_medicine | 1.0 | 1 |
| exam | AverageAccuracy | ceval | advanced_mathematics | 1.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_chemistry | 1.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_geography | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | business_ethics | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | high_school_european_history | 0.0 | 1 |
| general | AverageAccuracy | mmlu_redux | high_school_chemistry | 0.0 | 1 |
| general | AverageAccuracy | mmlu_redux | electrical_engineering | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | global_facts | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | college_medicine | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | college_chemistry | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | computer_security | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | conceptual_physics | 1.0 | 1 |
| general | AverageAccuracy | mmlu_pro | history | 0.0 | 1 |
| general | AverageAccuracy | mmlu_pro | computer science | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | international_law | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | marketing | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | prehistory | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | miscellaneous | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | sociology | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | professional_law | 1.0 | 1 |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - dataset_level Report:
+-----------+-----------------+-----------------+---------------+-------+
| task_type | metric | dataset_name | average_score | count |
+-----------+-----------------+-----------------+---------------+-------+
| exam | AverageAccuracy | iquiz | 0.73 | 100 |
| code | Pass@1 | live_code_bench | 0.0 | 90 |
| exam | AverageAccuracy | ceval | 0.8333 | 90 |
| general | AverageAccuracy | mmlu_pro | 0.8556 | 90 |
| general | AverageAccuracy | mmlu_redux | 0.8667 | 90 |
| knowledge | AveragePass@1 | gpqa | 0.6333 | 90 |
| math | AverageAccuracy | gsm8k | 0.9444 | 90 |
| reasoning | AverageAccuracy | hellaswag | 0.7889 | 90 |
| reasoning | AverageAccuracy | arc | 0.9778 | 90 |
| math | AveragePass@1 | aime25 | 0.6667 | 30 |
| math | AveragePass@1 | aime24 | 0.8 | 30 |
+-----------+-----------------+-----------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - task_level Report:
+-----------+-----------------+---------------+-------+
| task_type | metric | average_score | count |
+-----------+-----------------+---------------+-------+
| exam | AverageAccuracy | 0.7789 | 190 |
| general | AverageAccuracy | 0.8611 | 180 |
| reasoning | AverageAccuracy | 0.8833 | 180 |
| code | Pass@1 | 0.0 | 90 |
| knowledge | AveragePass@1 | 0.6333 | 90 |
| math | AverageAccuracy | 0.9444 | 90 |
| math | AveragePass@1 | 0.7333 | 60 |
+-----------+-----------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - tag_level Report:
+------+-----------------+---------------+-------+
| tags | metric | average_score | count |
+------+-----------------+---------------+-------+
| en | AverageAccuracy | 0.8867 | 450 |
| zh | AverageAccuracy | 0.7789 | 190 |
| en | AveragePass@1 | 0.6733 | 150 |
| en | Pass@1 | 0.0 | 90 |
+------+-----------------+---------------+-------+
2025-05-24 17:34:35,916 - evalscope - INFO - category_level Report:
+-----------+--------------+-----------------+---------------+-------+
| category0 | category1 | metric | average_score | count |
+-----------+--------------+-----------------+---------------+-------+
| mix | Chinese | AverageAccuracy | 0.7789 | 190 |
| mix | reasoning | AverageAccuracy | 0.8833 | 180 |
| mix | general | AverageAccuracy | 0.8611 | 180 |
| mix | Math&Science | AveragePass@1 | 0.6733 | 150 |
| mix | Math&Science | AverageAccuracy | 0.9444 | 90 |
| mix | code | Pass@1 | 0.0 | 90 |
+-----------+--------------+-----------------+---------------+-------+
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
new_handle = reduction.duplicate(pipe_handle,
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews: 68%|██████▊ | 601/880 [00:20<00:09, 29.24it/s]2025-05-24 17:34:38,938 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:38,962 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:43,565 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:48,648 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:53,776 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:53,778 - evalscope - INFO - {
"model": "Qwen3-14B-AWQ",
"model_id": "Qwen3-14B-AWQ",
"model_args": {},
"model_task": "text_generation",
"template_type": null,
"chat_template": null,
"datasets": [
"data_collection"
],
"dataset_args": {
"data_collection": {
"dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
"filters": {
"remove_until": "</think>"
}
}
},
"dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_tokens": 30000,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"n": 1
},
"eval_type": "service",
"eval_backend": "Native",
"eval_config": null,
"stage": "all",
"limit": null,
"eval_batch_size": 15,
"mem_cache": false,
"use_cache": "./outputs/20250524_152506",
"work_dir": "./outputs/20250524_152506",
"outputs": null,
"ignore_errors": false,
"debug": false,
"dry_run": false,
"seed": 42,
"api_url": "http://127.0.0.1:8889/v1/chat/completions",
"api_key": "EMPTY",
"timeout": 60000,
"stream": true,
"judge_strategy": "auto",
"judge_worker_num": 1,
"judge_model_args": {}
}
2025-05-24 17:34:54,629 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:54,630 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews: 100%|██████████| 880/880 [00:00<00:00, 880483.66it/s]
Getting scores: 100%|██████████| 880/880 [00:00<00:00, 440031.89it/s]
2025-05-24 17:34:55,752 - evalscope - INFO - subset_level Report:
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| task_type | metric | dataset_name | subset_name | average_score | count |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| code | Pass@1 | live_code_bench | v5_v6 | 0.0 | 90 |
| math | AverageAccuracy | gsm8k | main | 0.9444 | 90 |
| reasoning | AverageAccuracy | hellaswag | default | 0.7889 | 90 |
| knowledge | AveragePass@1 | gpqa | gpqa_diamond | 0.6333 | 90 |
| exam | AverageAccuracy | iquiz | EQ | 0.6866 | 67 |
| reasoning | AverageAccuracy | arc | ARC-Easy | 0.9661 | 59 |
| exam | AverageAccuracy | iquiz | IQ | 0.8182 | 33 |
| reasoning | AverageAccuracy | arc | ARC-Challenge | 1.0 | 31 |
| math | AveragePass@1 | aime24 | default | 0.8 | 30 |
| general | AverageAccuracy | mmlu_pro | physics | 0.9412 | 17 |
| math | AveragePass@1 | aime25 | AIME2025-I | 0.6 | 15 |
| math | AveragePass@1 | aime25 | AIME2025-II | 0.7333 | 15 |
| general | AverageAccuracy | mmlu_pro | economics | 1.0 | 11 |
| general | AverageAccuracy | mmlu_pro | math | 1.0 | 10 |
| general | AverageAccuracy | mmlu_pro | law | 0.625 | 8 |
| general | AverageAccuracy | mmlu_pro | other | 0.875 | 8 |
| general | AverageAccuracy | mmlu_pro | business | 0.8571 | 7 |
| general | AverageAccuracy | mmlu_pro | psychology | 0.7143 | 7 |
| general | AverageAccuracy | mmlu_pro | chemistry | 1.0 | 6 |
| general | AverageAccuracy | mmlu_redux | management | 0.8 | 5 |
| general | AverageAccuracy | mmlu_pro | health | 0.8 | 5 |
| exam | AverageAccuracy | ceval | civil_servant | 0.8 | 5 |
| exam | AverageAccuracy | ceval | tax_accountant | 0.8 | 5 |
| exam | AverageAccuracy | ceval | discrete_mathematics | 0.25 | 4 |
| exam | AverageAccuracy | ceval | college_economics | 0.75 | 4 |
| general | AverageAccuracy | mmlu_pro | biology | 0.75 | 4 |
| general | AverageAccuracy | mmlu_redux | high_school_macroeconomics | 1.0 | 4 |
| general | AverageAccuracy | mmlu_redux | jurisprudence | 0.75 | 4 |
| general | AverageAccuracy | mmlu_redux | high_school_world_history | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | high_school_mathematics | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | security_studies | 0.6667 | 3 |
| general | AverageAccuracy | mmlu_redux | professional_medicine | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | high_school_government_and_politics | 1.0 | 3 |
| general | AverageAccuracy | mmlu_pro | philosophy | 0.3333 | 3 |
| exam | AverageAccuracy | ceval | high_school_physics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | high_school_politics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | business_administration | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | teacher_qualification | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | abstract_algebra | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | astronomy | 1.0 | 3 |
| general | AverageAccuracy | mmlu_redux | anatomy | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | middle_school_politics | 1.0 | 3 |
| exam | AverageAccuracy | ceval | probability_and_statistics | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | physician | 0.6667 | 3 |
| exam | AverageAccuracy | ceval | mao_zedong_thought | 1.0 | 2 |
| exam | AverageAccuracy | ceval | college_chemistry | 1.0 | 2 |
| exam | AverageAccuracy | ceval | metrology_engineer | 0.5 | 2 |
| exam | AverageAccuracy | ceval | environmental_impact_assessment_engineer | 1.0 | 2 |
| exam | AverageAccuracy | ceval | electrical_engineer | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_psychology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_microeconomics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | professional_accounting | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | professional_psychology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | nutrition | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | philosophy | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_statistics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_us_history | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | public_relations | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | us_foreign_policy | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_geography | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | elementary_mathematics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | college_physics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | middle_school_mathematics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | plant_protection | 1.0 | 2 |
| exam | AverageAccuracy | ceval | middle_school_biology | 1.0 | 2 |
| exam | AverageAccuracy | ceval | high_school_biology | 1.0 | 2 |
| exam | AverageAccuracy | ceval | high_school_history | 1.0 | 2 |
| exam | AverageAccuracy | ceval | legal_professional | 0.5 | 2 |
| exam | AverageAccuracy | ceval | high_school_mathematics | 1.0 | 2 |
| exam | AverageAccuracy | ceval | chinese_language_and_literature | 1.0 | 2 |
| exam | AverageAccuracy | ceval | clinical_medicine | 0.5 | 2 |
| exam | AverageAccuracy | ceval | accountant | 1.0 | 2 |
| exam | AverageAccuracy | ceval | art_studies | 1.0 | 2 |
| exam | AverageAccuracy | ceval | computer_architecture | 1.0 | 2 |
| exam | AverageAccuracy | ceval | education_science | 1.0 | 2 |
| exam | AverageAccuracy | ceval | college_programming | 1.0 | 2 |
| general | AverageAccuracy | mmlu_pro | engineering | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | clinical_knowledge | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | econometrics | 0.5 | 2 |
| general | AverageAccuracy | mmlu_redux | college_mathematics | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | college_computer_science | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | high_school_biology | 1.0 | 2 |
| general | AverageAccuracy | mmlu_redux | formal_logic | 1.0 | 2 |
| exam | AverageAccuracy | ceval | urban_and_rural_planner | 1.0 | 1 |
| exam | AverageAccuracy | ceval | sports_science | 1.0 | 1 |
| exam | AverageAccuracy | ceval | marxism | 0.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_physics | 1.0 | 1 |
| exam | AverageAccuracy | ceval | modern_chinese_history | 0.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_history | 1.0 | 1 |
| exam | AverageAccuracy | ceval | law | 1.0 | 1 |
| exam | AverageAccuracy | ceval | ideological_and_moral_cultivation | 1.0 | 1 |
| exam | AverageAccuracy | ceval | logic | 1.0 | 1 |
| exam | AverageAccuracy | ceval | basic_medicine | 1.0 | 1 |
| exam | AverageAccuracy | ceval | advanced_mathematics | 1.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_chemistry | 1.0 | 1 |
| exam | AverageAccuracy | ceval | middle_school_geography | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | business_ethics | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | high_school_european_history | 0.0 | 1 |
| general | AverageAccuracy | mmlu_redux | high_school_chemistry | 0.0 | 1 |
| general | AverageAccuracy | mmlu_redux | electrical_engineering | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | global_facts | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | college_medicine | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | college_chemistry | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | computer_security | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | conceptual_physics | 1.0 | 1 |
| general | AverageAccuracy | mmlu_pro | history | 0.0 | 1 |
| general | AverageAccuracy | mmlu_pro | computer science | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | international_law | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | marketing | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | prehistory | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | miscellaneous | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | sociology | 1.0 | 1 |
| general | AverageAccuracy | mmlu_redux | professional_law | 1.0 | 1 |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - dataset_level Report:
+-----------+-----------------+-----------------+---------------+-------+
| task_type | metric | dataset_name | average_score | count |
+-----------+-----------------+-----------------+---------------+-------+
| exam | AverageAccuracy | iquiz | 0.73 | 100 |
| code | Pass@1 | live_code_bench | 0.0 | 90 |
| exam | AverageAccuracy | ceval | 0.8333 | 90 |
| general | AverageAccuracy | mmlu_pro | 0.8556 | 90 |
| general | AverageAccuracy | mmlu_redux | 0.8667 | 90 |
| knowledge | AveragePass@1 | gpqa | 0.6333 | 90 |
| math | AverageAccuracy | gsm8k | 0.9444 | 90 |
| reasoning | AverageAccuracy | hellaswag | 0.7889 | 90 |
| reasoning | AverageAccuracy | arc | 0.9778 | 90 |
| math | AveragePass@1 | aime25 | 0.6667 | 30 |
| math | AveragePass@1 | aime24 | 0.8 | 30 |
+-----------+-----------------+-----------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - task_level Report:
+-----------+-----------------+---------------+-------+
| task_type | metric | average_score | count |
+-----------+-----------------+---------------+-------+
| exam | AverageAccuracy | 0.7789 | 190 |
| general | AverageAccuracy | 0.8611 | 180 |
| reasoning | AverageAccuracy | 0.8833 | 180 |
| code | Pass@1 | 0.0 | 90 |
| knowledge | AveragePass@1 | 0.6333 | 90 |
| math | AverageAccuracy | 0.9444 | 90 |
| math | AveragePass@1 | 0.7333 | 60 |
+-----------+-----------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - tag_level Report:
+------+-----------------+---------------+-------+
| tags | metric | average_score | count |
+------+-----------------+---------------+-------+
| en | AverageAccuracy | 0.8867 | 450 |
| zh | AverageAccuracy | 0.7789 | 190 |
| en | AveragePass@1 | 0.6733 | 150 |
| en | Pass@1 | 0.0 | 90 |
+------+-----------------+---------------+-------+
2025-05-24 17:34:55,755 - evalscope - INFO - category_level Report:
+-----------+--------------+-----------------+---------------+-------+
| category0 | category1 | metric | average_score | count |
+-----------+--------------+-----------------+---------------+-------+
| mix | Chinese | AverageAccuracy | 0.7789 | 190 |
| mix | reasoning | AverageAccuracy | 0.8833 | 180 |
| mix | general | AverageAccuracy | 0.8611 | 180 |
| mix | Math&Science | AveragePass@1 | 0.6733 | 150 |
| mix | Math&Science | AverageAccuracy | 0.9444 | 90 |
| mix | code | Pass@1 | 0.0 | 90 |
+-----------+--------------+-----------------+---------------+-------+
Getting reviews: 68%|██████▊ | 601/880 [00:39<00:09, 29.24it/s]Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
new_handle = reduction.duplicate(pipe_handle,
File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews: 68%|██████▊ | 602/880 [00:40<00:22, 12.27it/s]2025-05-24 17:34:58,919 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
后面一直循环打印report和报错OSError: [WinError 6] 句柄无效
运行环境
- 操作系统:windows11 26100.4061
- Python版本:Python 3.10.16
其他信息
不使用use_cache的情况下,Getting answers 完成后,会一直重复Getting answers