opencompass
opencompass copied to clipboard
[Bug] mmlu_pro结果正则提取出错
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
None
重现问题 - 代码/配置示例
None
重现问题 - 命令或脚本
None
重现问题 - 错误信息
mmlu_pro输出结果提取时,以Answer: Let's think step by step.开头的回答提取出来的选项都是L,导致结果和分数计算错误
其他信息
建议修改 first_option_postprocess
def first_option_postprocess(text: str, options: str, cushion=True) -> str:
text = text.replace("Answer: Let's think step by step.", '')
....
def first_option_postprocess(text: str, options: str, cushion=True) -> str:
text = text.replace("Answer: Let's think step by step.", '')
....
我按照这个修改后,结果提取看起来是正常了的,但是这么修改可能不太合理,所以没有提交PR;辛苦看一下是否有更合适的修改方式。
Thanks for your report, we will check this bug and reply soon.
Can you provide a running command for this bug?
我的有一些改动,就是使用Qwen2.5-VL-Instruct测评mmlu_pro,结果中就会出现此问题;我尝试使用Qwen2.5-0.5B-Instruct也有此问题。如果尝试复现可以使用此命令:
python run.py eval.py
eval.py内容如下:
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_gen_cdbebf import mmlu_pro_datasets
datasets = [*mmlu_pro_datasets]
from opencompass.models import HuggingFacewithChatTemplate
model_path = '/path/to/Qwen2.5-0.5B-Instruct'
models = [
dict(
type=HuggingFacewithChatTemplate,
path=model_path,
tokenizer_path=model_path,
max_seq_len=7168,
max_out_len=1024,
batch_size=16,
run_cfg=dict(num_gpus=1),
)
]
这个问题还存在,核心原因是mmlu_pro eval时的正则表达式有问题,提取了第一个符合格式answer: x 的字母X作为回答。
mmlu_pro的eval_cfg:
mmlu_pro_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(
type=match_answer_pattern,
answer_pattern=r'(?i)ANSWER\s*:\s*([A-P])')
)
def match_answer_pattern(response_text: str, answer_pattern: str):
match = re.search(answer_pattern, response_text)
extracted_answer = match.group(1) if match else ''
return extracted_answer
因此复现bug的一个例子如下:
import re
answer_pattern=r'(?i)ANSWER\s*:\s*([A-P])'
def match_answer_pattern(response_text: str, answer_pattern: str):
match = re.search(answer_pattern, response_text)
extracted_answer = match.group(1) if match else ""
return extracted_answer
RAW_STR = "The Final answer: ANSWER: F"
out = match_answer_pattern(RAW_STR, answer_pattern)
print(out)
输出结果是A,匹配的是The Final answer: A 这部分。
一个可能的修改方式是修改answer_pattern为r'(?s).*(?i)ANSWER\s*:\s*([A-P])',简单测试是OK的,更多情况需要进一步验证