opencompass
opencompass copied to clipboard
[Bug] MTBench101 dataset multi-turn
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
L20
`conda create --name opencompass python=3.10
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass cd opencompass pip install -e .`
重现问题 - 代码/配置示例
opencompass/configs/datasets/subjective/multiround/mtbench101_judge.py
from opencompass.openicl.icl_prompt_template import PromptTemplate
2 from opencompass.openicl.icl_retriever import ZeroRetriever
3 from opencompass.openicl.icl_inferencer import ChatInferencer, GenInferencer
4 from opencompass.openicl.icl_evaluator import LMEvaluator
5 from opencompass.datasets import MTBench101Dataset
6 from opencompass.summarizers import MTBench101Summarizer
7 from opencompass.models import HuggingFacewithChatTemplate
8 from opencompass.openicl.icl_evaluator import AccEvaluator
9 subjective_reader_cfg = dict(
10 input_columns=['dialogue','task','multi_id','turn_id','system_prompt','prompt_template'],
11 output_column='judge',
12 )
13
14 subjective_all_sets = [
15 'mtbench101',
16 ]
17 data_path ='/home/ljj/data/subjective'
18
19 cfg=dict(
20 type=HuggingFacewithChatTemplate,
21 abbr='qwen2.5-7b-instruct-hf', # 模型的缩写
22 path='/data/Qwen2.5-7B-Instruct/', # 模型的 HuggingFace 路径
23 max_out_len=102, # 生成的最大 token 数
24 batch_size=8, # 批量大小
25 run_cfg=dict(num_gpus=1), # 该模型所需的 GPU 数量
26 )
27
28
29 mtbench101_datasets = []
30
31 for _name in subjective_all_sets:
32 subjective_infer_cfg = dict(
33 prompt_template=dict(
34 type=PromptTemplate,
35 template="""{dialogue}""",
36 ),
37 retriever=dict(type=ZeroRetriever),
38 inferencer=dict(type=ChatInferencer, infer_mode='last'),
39 )
40
41 subjective_eval_cfg = dict(
42 evaluator=dict(type=AccEvaluator),
64 pred_role='BOT',
65 )
66
67 mtbench101_datasets.append(
68 dict(
69 abbr=f'{_name}',
70 type=MTBench101Dataset,
71 path=data_path,
72 name=_name,
73 reader_cfg=subjective_reader_cfg,
74 infer_cfg=subjective_infer_cfg,
75 eval_cfg=subjective_eval_cfg,
76 mode='singlescore',
77 summarizer = dict(type=MTBench101Summarizer, judge_type='single')
78 ))
opencompass/configs/models/qwen2_5/hf_qwen2_5_7b_instruct.py
from opencompass.models import HuggingFacewithChatTemplate
2
3 models = [
4 dict(
5 type=HuggingFacewithChatTemplate,
6 abbr='qwen2.5-7b-instruct-hf',
7 path='/data/Qwen2.5-7B-Instruct/',
8 max_out_len=40,
9 batch_size=8,
10 run_cfg=dict(num_gpus=1),
11 )
12 ]
重现问题 - 命令或脚本
python run.py --datasets mtbench101_judge --models hf_qwen2_5_7b_instruct --debug
重现问题 - 错误信息
For the MT-Bench101 dataset, in its load method, each turn is constructed as an individual data sample (datasets/subjective/mtbench101.py).
Consequently, there is no real multi-turn context left during inference, making infer_last and infer_every behave identically (openicl/icl_inferencer/icl_chat_inferencer.py).
Why was it designed this way?
其他信息
No response