eval-scope icon indicating copy to clipboard operation
eval-scope copied to clipboard

没有数据集tool_bench

Open dgzxx-2000 opened this issue 7 months ago • 1 comments

自查清单

在提交 issue 之前,请确保您已完成以下步骤:

问题描述

Exception: Unknown benchmark: tool_bench. Available tasks: ['drop', 'humaneval', 'mmlu_redux', 'hpdv2', 'genai_bench', 'evalmuse', 'general_t2i', 'tifa160', 'gsm8k', 'general_qa', 'gpqa', 'competition_math', 'arena_hard', 'super_gpqa', 'simple_qa', 'mmlu', 'winogrande', 'cmmlu', 'live_code_bench', 'math_500', 'maritime_bench', 'trivia_qa', 'bbh', 'ceval', 'process_bench', 'hellaswag', 'race', 'truthful_qa', 'iquiz', 'ifeval', 'chinese_simpleqa', 'musr', 'alpaca_eval', 'arc', 'general_mcq', 'data_collection', 'mmlu_pro', 'aime25', 'aime24']

EvalScope 版本(必填)

v2.0.0

使用的工具

  • [ ] Native / 原生框架
  • [ ] Opencompass backend
  • [ ] VLMEvalKit backend
  • [ ] RAGEval backend
  • [ ] Perf / 模型推理压测工具
  • [ ] Arena / 竞技场模式

执行的代码或指令

(evalscope) zengxiangxi@xianshitest3day3:~/project/lvyouLLM/Fine_tuning/evl$ python evl_tool.py

错误日志

(evalscope) zengxiangxi@xianshitest3day3:~/project/lvyouLLM/Fine_tuning/evl$ python evl_tool.py 2025-05-26 11:16:36,720 - evalscope - INFO - Args: Task config is provided with TaskConfig type. 2025-05-26 11:16:38,700 - evalscope - INFO - Loading model /data/home/zengxiangxi/project/lvyouLLM/Fine_tuning/merged_qwen3_0.6b_lora_tourism3 ... Traceback (most recent call last): File "/data/home/zengxiangxi/project/lvyouLLM/Fine_tuning/evl/evl_tool.py", line 17, in run_task(task_cfg=task_cfg) File "/data/home/zengxiangxi/app/EvalScope/evalscope/evalscope/run.py", line 31, in run_task return run_single_task(task_cfg, run_time) File "/data/home/zengxiangxi/app/EvalScope/evalscope/evalscope/run.py", line 44, in run_single_task result = evaluate_model(task_cfg, outputs) File "/data/home/zengxiangxi/app/EvalScope/evalscope/evalscope/run.py", line 118, in evaluate_model evaluator = create_evaluator(task_cfg, dataset_name, outputs, base_model) File "/data/home/zengxiangxi/app/EvalScope/evalscope/evalscope/run.py", line 148, in create_evaluator benchmark: BenchmarkMeta = Benchmark.get(dataset_name) File "/data/home/zengxiangxi/app/EvalScope/evalscope/evalscope/benchmarks/benchmark.py", line 65, in get raise Exception(f'Unknown benchmark: {name}. Available tasks: {list(BENCHMARK_MAPPINGS.keys())}') Exception: Unknown benchmark: tool_bench. Available tasks: ['drop', 'humaneval', 'mmlu_redux', 'hpdv2', 'genai_bench', 'evalmuse', 'general_t2i', 'tifa160', 'gsm8k', 'general_qa', 'gpqa', 'competition_math', 'arena_hard', 'super_gpqa', 'simple_qa', 'mmlu', 'winogrande', 'cmmlu', 'live_code_bench', 'math_500', 'maritime_bench', 'trivia_qa', 'bbh', 'ceval', 'process_bench', 'hellaswag', 'race', 'truthful_qa', 'iquiz', 'ifeval', 'chinese_simpleqa', 'musr', 'alpaca_eval', 'arc', 'general_mcq', 'data_collection', 'mmlu_pro', 'aime25', 'aime24']

运行环境

from evalscope import TaskConfig, run_task

task_cfg = TaskConfig( model='/data/home/zengxiangxi/project/lvyouLLM/Fine_tuning/merged_qwen3_0.6b_lora_tourism3', datasets=['tool_bench'], limit=5, eval_batch_size=5, generation_config={ 'max_new_tokens': 512, # Maximum number of tokens to generate, set to a large value to avoid truncation 'temperature': 0.7, # Sampling temperature (recommended value by qwen) 'top_p': 0.8, # Top-p sampling (recommended value by qwen) 'top_k': 20, # Top-k sampling (recommended value by qwen) 'chat_template_kwargs': {'enable_thinking': False} # Disable thinking mode } )

run_task(task_cfg=task_cfg)

  • 操作系统:
  • Python版本:3.10

其他信息

如果有其他相关信息,请在此处提供。

dgzxx-2000 avatar May 26 '25 03:05 dgzxx-2000