opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] MTBench101 dataset multi-turn

Open wenba0 opened this issue 1 month ago • 0 comments

先决条件

  • [x] 我已经搜索过 问题讨论 但未得到预期的帮助。
  • [x] 错误在 最新版本 中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

L20
`conda create --name opencompass python=3.10 conda activate opencompass

git clone https://github.com/open-compass/opencompass opencompass cd opencompass pip install -e .`

重现问题 - 代码/配置示例

opencompass/configs/datasets/subjective/multiround/mtbench101_judge.py

from opencompass.openicl.icl_prompt_template import PromptTemplate
  2 from opencompass.openicl.icl_retriever import ZeroRetriever
  3 from opencompass.openicl.icl_inferencer import ChatInferencer, GenInferencer
  4 from opencompass.openicl.icl_evaluator import LMEvaluator
  5 from opencompass.datasets import MTBench101Dataset
  6 from opencompass.summarizers import MTBench101Summarizer
  7 from opencompass.models import HuggingFacewithChatTemplate
  8 from opencompass.openicl.icl_evaluator import AccEvaluator
  9 subjective_reader_cfg = dict(
 10     input_columns=['dialogue','task','multi_id','turn_id','system_prompt','prompt_template'],
 11     output_column='judge',
 12     )
 13
 14 subjective_all_sets = [
 15     'mtbench101',
 16 ]
 17 data_path ='/home/ljj/data/subjective'
 18
 19 cfg=dict(
 20         type=HuggingFacewithChatTemplate,
 21         abbr='qwen2.5-7b-instruct-hf',         # 模型的缩写
 22         path='/data/Qwen2.5-7B-Instruct/',   # 模型的 HuggingFace 路径
 23         max_out_len=102,                      # 生成的最大 token 数
 24         batch_size=8,                          # 批量大小
 25         run_cfg=dict(num_gpus=1),              # 该模型所需的 GPU 数量
 26     )
 27
 28
 29 mtbench101_datasets = []
 30
 31 for _name in subjective_all_sets:
 32     subjective_infer_cfg = dict(
 33             prompt_template=dict(
 34                 type=PromptTemplate,
 35                 template="""{dialogue}""",
 36             ),
 37             retriever=dict(type=ZeroRetriever),
 38             inferencer=dict(type=ChatInferencer, infer_mode='last'),
 39         )
 40
 41     subjective_eval_cfg = dict(
 42         evaluator=dict(type=AccEvaluator),
 64         pred_role='BOT',
 65     )
 66
 67     mtbench101_datasets.append(
 68         dict(
 69             abbr=f'{_name}',
 70             type=MTBench101Dataset,
 71             path=data_path,
 72             name=_name,
 73             reader_cfg=subjective_reader_cfg,
 74             infer_cfg=subjective_infer_cfg,
 75             eval_cfg=subjective_eval_cfg,
 76             mode='singlescore',
 77             summarizer = dict(type=MTBench101Summarizer, judge_type='single')
 78         ))

opencompass/configs/models/qwen2_5/hf_qwen2_5_7b_instruct.py

from opencompass.models import HuggingFacewithChatTemplate
  2
  3 models = [
  4     dict(
  5         type=HuggingFacewithChatTemplate,
  6         abbr='qwen2.5-7b-instruct-hf',
  7         path='/data/Qwen2.5-7B-Instruct/',
  8         max_out_len=40,
  9         batch_size=8,
 10         run_cfg=dict(num_gpus=1),
 11     )
 12 ]

重现问题 - 命令或脚本

python run.py --datasets mtbench101_judge --models hf_qwen2_5_7b_instruct --debug

重现问题 - 错误信息

For the MT-Bench101 dataset, in its load method, each turn is constructed as an individual data sample (datasets/subjective/mtbench101.py). Consequently, there is no real multi-turn context left during inference, making infer_last and infer_every behave identically (openicl/icl_inferencer/icl_chat_inferencer.py). Why was it designed this way?

其他信息

No response

wenba0 avatar Oct 15 '25 07:10 wenba0