eval-scope icon indicating copy to clipboard operation
eval-scope copied to clipboard

return _winapi.DuplicateHandle( OSError: [WinError 6] 句柄无效。

Open AInseven opened this issue 7 months ago • 0 comments

自查清单

在提交 issue 之前,请确保您已完成以下步骤:

问题描述

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
    new_handle = reduction.duplicate(pipe_handle,
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
    return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews:  69%|██████▉   | 605/880 [01:40<01:44,  2.64it/s]2025-05-24 17:35:58,787 - evalscope - INFO - Args: Task config is provided with TaskConfig type.

EvalScope 版本(必填)

v0.16.0

使用的工具

  • [x ] Native / 原生框架
  • [ ] Opencompass backend
  • [ ] VLMEvalKit backend
  • [ ] RAGEval backend
  • [ ] Perf / 模型推理压测工具
  • [ ] Arena / 竞技场模式

执行的代码或指令

from evalscope import TaskConfig, run_task
dataset_path=r'C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl'
task_cfg = TaskConfig(
    model='Qwen3-14B-AWQ',
    api_url='http://127.0.0.1:8889/v1/chat/completions',
    eval_type='service',
    datasets=[
        'data_collection',
    ],
    dataset_args={
        'data_collection': {
            'dataset_id': dataset_path,
            'filters': {'remove_until': '</think>'}  # 过滤掉思考的内容
        }
    },
    eval_batch_size=15,
    generation_config={
        'max_tokens': 30000,  # 最大生成token数,建议设置为较大值避免输出截断
        'temperature': 0.6,  # 采样温度 (qwen 报告推荐值)
        'top_p': 0.95,  # top-p采样 (qwen 报告推荐值)
        'top_k': 20,  # top-k采样 (qwen 报告推荐值)
        'n': 1,  # 每个请求产生的回复数量
    },
    timeout=60000,  # 超时时间
    stream=True,  # 是否使用流式输出
    # limit=2000,  # 设置为100条数据进行测试
    use_cache=r"./outputs/20250524_152506"
)

run_task(task_cfg=task_cfg)

错误日志

C:\ProgramData\anaconda3\envs\evalscope\python.exe C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\H800\autodl_h800.py 
2025-05-24 17:34:00,366 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:00,381 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:04,916 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:10,057 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:15,274 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:15,276 - evalscope - INFO - {
    "model": "Qwen3-14B-AWQ",
    "model_id": "Qwen3-14B-AWQ",
    "model_args": {},
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "data_collection"
    ],
    "dataset_args": {
        "data_collection": {
            "dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
            "filters": {
                "remove_until": "</think>"
            }
        }
    },
    "dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 30000,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "n": 1
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": null,
    "eval_batch_size": 15,
    "mem_cache": false,
    "use_cache": "./outputs/20250524_152506",
    "work_dir": "./outputs/20250524_152506",
    "outputs": null,
    "ignore_errors": false,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://127.0.0.1:8889/v1/chat/completions",
    "api_key": "EMPTY",
    "timeout": 60000,
    "stream": true,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {}
}
2025-05-24 17:34:16,135 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:16,136 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews:   0%|          | 0/880 [00:00<?, ?it/s]2025-05-24 17:34:18,363 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:18,376 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:23,077 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:28,213 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:33,327 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:33,330 - evalscope - INFO - {
    "model": "Qwen3-14B-AWQ",
    "model_id": "Qwen3-14B-AWQ",
    "model_args": {},
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "data_collection"
    ],
    "dataset_args": {
        "data_collection": {
            "dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
            "filters": {
                "remove_until": "</think>"
            }
        }
    },
    "dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 30000,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "n": 1
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": null,
    "eval_batch_size": 15,
    "mem_cache": false,
    "use_cache": "./outputs/20250524_152506",
    "work_dir": "./outputs/20250524_152506",
    "outputs": null,
    "ignore_errors": false,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://127.0.0.1:8889/v1/chat/completions",
    "api_key": "EMPTY",
    "timeout": 60000,
    "stream": true,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {}
}
2025-05-24 17:34:34,311 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:34,313 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews: 100%|██████████| 880/880 [00:01<00:00, 578.38it/s]
Getting scores: 100%|██████████| 880/880 [00:00<00:00, 381103.51it/s]
2025-05-24 17:34:35,912 - evalscope - INFO - subset_level Report:
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| task_type |     metric      |  dataset_name   |               subset_name                | average_score | count |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
|   code    |     Pass@1      | live_code_bench |                  v5_v6                   |      0.0      |  90   |
|   math    | AverageAccuracy |      gsm8k      |                   main                   |    0.9444     |  90   |
| reasoning | AverageAccuracy |    hellaswag    |                 default                  |    0.7889     |  90   |
| knowledge |  AveragePass@1  |      gpqa       |               gpqa_diamond               |    0.6333     |  90   |
|   exam    | AverageAccuracy |      iquiz      |                    EQ                    |    0.6866     |  67   |
| reasoning | AverageAccuracy |       arc       |                 ARC-Easy                 |    0.9661     |  59   |
|   exam    | AverageAccuracy |      iquiz      |                    IQ                    |    0.8182     |  33   |
| reasoning | AverageAccuracy |       arc       |              ARC-Challenge               |      1.0      |  31   |
|   math    |  AveragePass@1  |     aime24      |                 default                  |      0.8      |  30   |
|  general  | AverageAccuracy |    mmlu_pro     |                 physics                  |    0.9412     |  17   |
|   math    |  AveragePass@1  |     aime25      |                AIME2025-I                |      0.6      |  15   |
|   math    |  AveragePass@1  |     aime25      |               AIME2025-II                |    0.7333     |  15   |
|  general  | AverageAccuracy |    mmlu_pro     |                economics                 |      1.0      |  11   |
|  general  | AverageAccuracy |    mmlu_pro     |                   math                   |      1.0      |  10   |
|  general  | AverageAccuracy |    mmlu_pro     |                   law                    |     0.625     |   8   |
|  general  | AverageAccuracy |    mmlu_pro     |                  other                   |     0.875     |   8   |
|  general  | AverageAccuracy |    mmlu_pro     |                 business                 |    0.8571     |   7   |
|  general  | AverageAccuracy |    mmlu_pro     |                psychology                |    0.7143     |   7   |
|  general  | AverageAccuracy |    mmlu_pro     |                chemistry                 |      1.0      |   6   |
|  general  | AverageAccuracy |   mmlu_redux    |                management                |      0.8      |   5   |
|  general  | AverageAccuracy |    mmlu_pro     |                  health                  |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |              civil_servant               |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |              tax_accountant              |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |           discrete_mathematics           |     0.25      |   4   |
|   exam    | AverageAccuracy |      ceval      |            college_economics             |     0.75      |   4   |
|  general  | AverageAccuracy |    mmlu_pro     |                 biology                  |     0.75      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_macroeconomics        |      1.0      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |              jurisprudence               |     0.75      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_world_history         |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |         high_school_mathematics          |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |             security_studies             |    0.6667     |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |          professional_medicine           |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |   high_school_government_and_politics    |      1.0      |   3   |
|  general  | AverageAccuracy |    mmlu_pro     |                philosophy                |    0.3333     |   3   |
|   exam    | AverageAccuracy |      ceval      |           high_school_physics            |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |           high_school_politics           |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |         business_administration          |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |          teacher_qualification           |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |             abstract_algebra             |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |                astronomy                 |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |                 anatomy                  |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_politics          |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |        probability_and_statistics        |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |                physician                 |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |            mao_zedong_thought            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            college_chemistry             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            metrology_engineer            |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      | environmental_impact_assessment_engineer |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           electrical_engineer            |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_psychology          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_microeconomics        |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         professional_accounting          |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         professional_psychology          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |                nutrition                 |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |                philosophy                |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_statistics          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_us_history          |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |             public_relations             |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |            us_foreign_policy             |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_geography           |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          elementary_mathematics          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |             college_physics              |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |        middle_school_mathematics         |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |             plant_protection             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_biology           |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           high_school_biology            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           high_school_history            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            legal_professional            |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      |         high_school_mathematics          |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |     chinese_language_and_literature      |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            clinical_medicine             |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      |                accountant                |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |               art_studies                |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |          computer_architecture           |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            education_science             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           college_programming            |      1.0      |   2   |
|  general  | AverageAccuracy |    mmlu_pro     |               engineering                |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |            clinical_knowledge            |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |               econometrics               |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |           college_mathematics            |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         college_computer_science         |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |           high_school_biology            |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |               formal_logic               |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |         urban_and_rural_planner          |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |              sports_science              |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                 marxism                  |      0.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_physics           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          modern_chinese_history          |      0.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_history           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                   law                    |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |    ideological_and_moral_cultivation     |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                  logic                   |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |              basic_medicine              |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |           advanced_mathematics           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |         middle_school_chemistry          |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |         middle_school_geography          |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             business_ethics              |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |       high_school_european_history       |      0.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_chemistry           |      0.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |          electrical_engineering          |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |               global_facts               |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             college_medicine             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            college_chemistry             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            computer_security             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            conceptual_physics            |      1.0      |   1   |
|  general  | AverageAccuracy |    mmlu_pro     |                 history                  |      0.0      |   1   |
|  general  | AverageAccuracy |    mmlu_pro     |             computer science             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            international_law             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                marketing                 |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                prehistory                |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |              miscellaneous               |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                sociology                 |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             professional_law             |      1.0      |   1   |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - dataset_level Report:
+-----------+-----------------+-----------------+---------------+-------+
| task_type |     metric      |  dataset_name   | average_score | count |
+-----------+-----------------+-----------------+---------------+-------+
|   exam    | AverageAccuracy |      iquiz      |     0.73      |  100  |
|   code    |     Pass@1      | live_code_bench |      0.0      |  90   |
|   exam    | AverageAccuracy |      ceval      |    0.8333     |  90   |
|  general  | AverageAccuracy |    mmlu_pro     |    0.8556     |  90   |
|  general  | AverageAccuracy |   mmlu_redux    |    0.8667     |  90   |
| knowledge |  AveragePass@1  |      gpqa       |    0.6333     |  90   |
|   math    | AverageAccuracy |      gsm8k      |    0.9444     |  90   |
| reasoning | AverageAccuracy |    hellaswag    |    0.7889     |  90   |
| reasoning | AverageAccuracy |       arc       |    0.9778     |  90   |
|   math    |  AveragePass@1  |     aime25      |    0.6667     |  30   |
|   math    |  AveragePass@1  |     aime24      |      0.8      |  30   |
+-----------+-----------------+-----------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - task_level Report:
+-----------+-----------------+---------------+-------+
| task_type |     metric      | average_score | count |
+-----------+-----------------+---------------+-------+
|   exam    | AverageAccuracy |    0.7789     |  190  |
|  general  | AverageAccuracy |    0.8611     |  180  |
| reasoning | AverageAccuracy |    0.8833     |  180  |
|   code    |     Pass@1      |      0.0      |  90   |
| knowledge |  AveragePass@1  |    0.6333     |  90   |
|   math    | AverageAccuracy |    0.9444     |  90   |
|   math    |  AveragePass@1  |    0.7333     |  60   |
+-----------+-----------------+---------------+-------+
2025-05-24 17:34:35,915 - evalscope - INFO - tag_level Report:
+------+-----------------+---------------+-------+
| tags |     metric      | average_score | count |
+------+-----------------+---------------+-------+
|  en  | AverageAccuracy |    0.8867     |  450  |
|  zh  | AverageAccuracy |    0.7789     |  190  |
|  en  |  AveragePass@1  |    0.6733     |  150  |
|  en  |     Pass@1      |      0.0      |  90   |
+------+-----------------+---------------+-------+
2025-05-24 17:34:35,916 - evalscope - INFO - category_level Report:
+-----------+--------------+-----------------+---------------+-------+
| category0 |  category1   |     metric      | average_score | count |
+-----------+--------------+-----------------+---------------+-------+
|    mix    |   Chinese    | AverageAccuracy |    0.7789     |  190  |
|    mix    |  reasoning   | AverageAccuracy |    0.8833     |  180  |
|    mix    |   general    | AverageAccuracy |    0.8611     |  180  |
|    mix    | Math&Science |  AveragePass@1  |    0.6733     |  150  |
|    mix    | Math&Science | AverageAccuracy |    0.9444     |  90   |
|    mix    |     code     |     Pass@1      |      0.0      |  90   |
+-----------+--------------+-----------------+---------------+-------+
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
    new_handle = reduction.duplicate(pipe_handle,
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
    return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews:  68%|██████▊   | 601/880 [00:20<00:09, 29.24it/s]2025-05-24 17:34:38,938 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
2025-05-24 17:34:38,962 - evalscope - INFO - Set resume from ./outputs/20250524_152506
2025-05-24 17:34:43,565 - evalscope - INFO - Loading dataset from C:\Users\Administrator\PycharmProjects\llmEval\EvalScope\dataset\weighted_mixed_data.jsonl
2025-05-24 17:34:48,648 - evalscope - WARNING - Output type is set to logits. This is not supported for service evaluation. Setting output type to generation by default.
2025-05-24 17:34:53,776 - evalscope - INFO - Dump task config to ./outputs/20250524_152506\configs\task_config_7fd2da.yaml
2025-05-24 17:34:53,778 - evalscope - INFO - {
    "model": "Qwen3-14B-AWQ",
    "model_id": "Qwen3-14B-AWQ",
    "model_args": {},
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "data_collection"
    ],
    "dataset_args": {
        "data_collection": {
            "dataset_id": "C:\\Users\\Administrator\\PycharmProjects\\llmEval\\EvalScope\\dataset\\weighted_mixed_data.jsonl",
            "filters": {
                "remove_until": "</think>"
            }
        }
    },
    "dataset_dir": "C:\\Users\\Administrator\\.cache\\modelscope\\hub\\datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 30000,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "n": 1
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": null,
    "eval_batch_size": 15,
    "mem_cache": false,
    "use_cache": "./outputs/20250524_152506",
    "work_dir": "./outputs/20250524_152506",
    "outputs": null,
    "ignore_errors": false,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://127.0.0.1:8889/v1/chat/completions",
    "api_key": "EMPTY",
    "timeout": 60000,
    "stream": true,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {}
}
2025-05-24 17:34:54,629 - evalscope - INFO - Reuse from ./outputs/20250524_152506\predictions\Qwen3-14B-AWQ\weighted_mixed_data.jsonl. Loaded 880 samples, remain 0 samples.
Getting answers: 0it [00:00, ?it/s]
2025-05-24 17:34:54,630 - evalscope - INFO - use_cache=./outputs/20250524_152506, reloading the review file: ./outputs/20250524_152506\reviews\Qwen3-14B-AWQ
Getting reviews: 100%|██████████| 880/880 [00:00<00:00, 880483.66it/s]
Getting scores: 100%|██████████| 880/880 [00:00<00:00, 440031.89it/s]
2025-05-24 17:34:55,752 - evalscope - INFO - subset_level Report:
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
| task_type |     metric      |  dataset_name   |               subset_name                | average_score | count |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
|   code    |     Pass@1      | live_code_bench |                  v5_v6                   |      0.0      |  90   |
|   math    | AverageAccuracy |      gsm8k      |                   main                   |    0.9444     |  90   |
| reasoning | AverageAccuracy |    hellaswag    |                 default                  |    0.7889     |  90   |
| knowledge |  AveragePass@1  |      gpqa       |               gpqa_diamond               |    0.6333     |  90   |
|   exam    | AverageAccuracy |      iquiz      |                    EQ                    |    0.6866     |  67   |
| reasoning | AverageAccuracy |       arc       |                 ARC-Easy                 |    0.9661     |  59   |
|   exam    | AverageAccuracy |      iquiz      |                    IQ                    |    0.8182     |  33   |
| reasoning | AverageAccuracy |       arc       |              ARC-Challenge               |      1.0      |  31   |
|   math    |  AveragePass@1  |     aime24      |                 default                  |      0.8      |  30   |
|  general  | AverageAccuracy |    mmlu_pro     |                 physics                  |    0.9412     |  17   |
|   math    |  AveragePass@1  |     aime25      |                AIME2025-I                |      0.6      |  15   |
|   math    |  AveragePass@1  |     aime25      |               AIME2025-II                |    0.7333     |  15   |
|  general  | AverageAccuracy |    mmlu_pro     |                economics                 |      1.0      |  11   |
|  general  | AverageAccuracy |    mmlu_pro     |                   math                   |      1.0      |  10   |
|  general  | AverageAccuracy |    mmlu_pro     |                   law                    |     0.625     |   8   |
|  general  | AverageAccuracy |    mmlu_pro     |                  other                   |     0.875     |   8   |
|  general  | AverageAccuracy |    mmlu_pro     |                 business                 |    0.8571     |   7   |
|  general  | AverageAccuracy |    mmlu_pro     |                psychology                |    0.7143     |   7   |
|  general  | AverageAccuracy |    mmlu_pro     |                chemistry                 |      1.0      |   6   |
|  general  | AverageAccuracy |   mmlu_redux    |                management                |      0.8      |   5   |
|  general  | AverageAccuracy |    mmlu_pro     |                  health                  |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |              civil_servant               |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |              tax_accountant              |      0.8      |   5   |
|   exam    | AverageAccuracy |      ceval      |           discrete_mathematics           |     0.25      |   4   |
|   exam    | AverageAccuracy |      ceval      |            college_economics             |     0.75      |   4   |
|  general  | AverageAccuracy |    mmlu_pro     |                 biology                  |     0.75      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_macroeconomics        |      1.0      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |              jurisprudence               |     0.75      |   4   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_world_history         |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |         high_school_mathematics          |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |             security_studies             |    0.6667     |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |          professional_medicine           |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |   high_school_government_and_politics    |      1.0      |   3   |
|  general  | AverageAccuracy |    mmlu_pro     |                philosophy                |    0.3333     |   3   |
|   exam    | AverageAccuracy |      ceval      |           high_school_physics            |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |           high_school_politics           |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |         business_administration          |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |          teacher_qualification           |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |             abstract_algebra             |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |                astronomy                 |      1.0      |   3   |
|  general  | AverageAccuracy |   mmlu_redux    |                 anatomy                  |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_politics          |      1.0      |   3   |
|   exam    | AverageAccuracy |      ceval      |        probability_and_statistics        |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |                physician                 |    0.6667     |   3   |
|   exam    | AverageAccuracy |      ceval      |            mao_zedong_thought            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            college_chemistry             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            metrology_engineer            |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      | environmental_impact_assessment_engineer |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           electrical_engineer            |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_psychology          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |        high_school_microeconomics        |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         professional_accounting          |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         professional_psychology          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |                nutrition                 |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |                philosophy                |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_statistics          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_us_history          |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |             public_relations             |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |            us_foreign_policy             |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_geography           |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |          elementary_mathematics          |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |             college_physics              |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |        middle_school_mathematics         |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |             plant_protection             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_biology           |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           high_school_biology            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           high_school_history            |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            legal_professional            |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      |         high_school_mathematics          |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |     chinese_language_and_literature      |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            clinical_medicine             |      0.5      |   2   |
|   exam    | AverageAccuracy |      ceval      |                accountant                |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |               art_studies                |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |          computer_architecture           |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |            education_science             |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |           college_programming            |      1.0      |   2   |
|  general  | AverageAccuracy |    mmlu_pro     |               engineering                |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |            clinical_knowledge            |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |               econometrics               |      0.5      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |           college_mathematics            |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |         college_computer_science         |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |           high_school_biology            |      1.0      |   2   |
|  general  | AverageAccuracy |   mmlu_redux    |               formal_logic               |      1.0      |   2   |
|   exam    | AverageAccuracy |      ceval      |         urban_and_rural_planner          |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |              sports_science              |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                 marxism                  |      0.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_physics           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          modern_chinese_history          |      0.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |          middle_school_history           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                   law                    |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |    ideological_and_moral_cultivation     |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |                  logic                   |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |              basic_medicine              |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |           advanced_mathematics           |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |         middle_school_chemistry          |      1.0      |   1   |
|   exam    | AverageAccuracy |      ceval      |         middle_school_geography          |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             business_ethics              |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |       high_school_european_history       |      0.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |          high_school_chemistry           |      0.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |          electrical_engineering          |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |               global_facts               |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             college_medicine             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            college_chemistry             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            computer_security             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            conceptual_physics            |      1.0      |   1   |
|  general  | AverageAccuracy |    mmlu_pro     |                 history                  |      0.0      |   1   |
|  general  | AverageAccuracy |    mmlu_pro     |             computer science             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |            international_law             |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                marketing                 |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                prehistory                |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |              miscellaneous               |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |                sociology                 |      1.0      |   1   |
|  general  | AverageAccuracy |   mmlu_redux    |             professional_law             |      1.0      |   1   |
+-----------+-----------------+-----------------+------------------------------------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - dataset_level Report:
+-----------+-----------------+-----------------+---------------+-------+
| task_type |     metric      |  dataset_name   | average_score | count |
+-----------+-----------------+-----------------+---------------+-------+
|   exam    | AverageAccuracy |      iquiz      |     0.73      |  100  |
|   code    |     Pass@1      | live_code_bench |      0.0      |  90   |
|   exam    | AverageAccuracy |      ceval      |    0.8333     |  90   |
|  general  | AverageAccuracy |    mmlu_pro     |    0.8556     |  90   |
|  general  | AverageAccuracy |   mmlu_redux    |    0.8667     |  90   |
| knowledge |  AveragePass@1  |      gpqa       |    0.6333     |  90   |
|   math    | AverageAccuracy |      gsm8k      |    0.9444     |  90   |
| reasoning | AverageAccuracy |    hellaswag    |    0.7889     |  90   |
| reasoning | AverageAccuracy |       arc       |    0.9778     |  90   |
|   math    |  AveragePass@1  |     aime25      |    0.6667     |  30   |
|   math    |  AveragePass@1  |     aime24      |      0.8      |  30   |
+-----------+-----------------+-----------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - task_level Report:
+-----------+-----------------+---------------+-------+
| task_type |     metric      | average_score | count |
+-----------+-----------------+---------------+-------+
|   exam    | AverageAccuracy |    0.7789     |  190  |
|  general  | AverageAccuracy |    0.8611     |  180  |
| reasoning | AverageAccuracy |    0.8833     |  180  |
|   code    |     Pass@1      |      0.0      |  90   |
| knowledge |  AveragePass@1  |    0.6333     |  90   |
|   math    | AverageAccuracy |    0.9444     |  90   |
|   math    |  AveragePass@1  |    0.7333     |  60   |
+-----------+-----------------+---------------+-------+
2025-05-24 17:34:55,754 - evalscope - INFO - tag_level Report:
+------+-----------------+---------------+-------+
| tags |     metric      | average_score | count |
+------+-----------------+---------------+-------+
|  en  | AverageAccuracy |    0.8867     |  450  |
|  zh  | AverageAccuracy |    0.7789     |  190  |
|  en  |  AveragePass@1  |    0.6733     |  150  |
|  en  |     Pass@1      |      0.0      |  90   |
+------+-----------------+---------------+-------+
2025-05-24 17:34:55,755 - evalscope - INFO - category_level Report:
+-----------+--------------+-----------------+---------------+-------+
| category0 |  category1   |     metric      | average_score | count |
+-----------+--------------+-----------------+---------------+-------+
|    mix    |   Chinese    | AverageAccuracy |    0.7789     |  190  |
|    mix    |  reasoning   | AverageAccuracy |    0.8833     |  180  |
|    mix    |   general    | AverageAccuracy |    0.8611     |  180  |
|    mix    | Math&Science |  AveragePass@1  |    0.6733     |  150  |
|    mix    | Math&Science | AverageAccuracy |    0.9444     |  90   |
|    mix    |     code     |     Pass@1      |      0.0      |  90   |
+-----------+--------------+-----------------+---------------+-------+
Getting reviews:  68%|██████▊   | 601/880 [00:39<00:09, 29.24it/s]Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\spawn.py", line 107, in spawn_main
    new_handle = reduction.duplicate(pipe_handle,
  File "C:\ProgramData\anaconda3\envs\evalscope\lib\multiprocessing\reduction.py", line 79, in duplicate
    return _winapi.DuplicateHandle(
OSError: [WinError 6] 句柄无效。
Getting reviews:  68%|██████▊   | 602/880 [00:40<00:22, 12.27it/s]2025-05-24 17:34:58,919 - evalscope - INFO - Args: Task config is provided with TaskConfig type.

后面一直循环打印report和报错OSError: [WinError 6] 句柄无效

运行环境

  • 操作系统:windows11 26100.4061
  • Python版本:Python 3.10.16

其他信息

不使用use_cache的情况下,Getting answers 完成后,会一直重复Getting answers

AInseven avatar May 24 '25 09:05 AInseven