opencompass
opencompass copied to clipboard
[Bug] MBPP evaluator cannot extract the correct anwser
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
torch2.2.0+vllm-0.4.0
Reproduces the problem - code/configuration sample
evaluate mbpp + qwen2-72b-vllm with following config
from mmengine.config import read_base
with read_base():
from ...mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
from opencompass.models import VLLM
models = [
dict(
type=VLLM,
abbr='qwen2-72b-vllm',
path='Qwen/Qwen2-72B',
model_kwargs=dict(tensor_parallel_size=4),
max_out_len=1024,
max_seq_len=8192,
batch_size=16,
generation_kwargs=dict(temperature=0),
run_cfg=dict(num_gpus=4),
)
]
Reproduces the problem - command or script
evaluate mbpp + qwen2-72b-vllm with following config
Reproduces the problem - error message
Unexpected mbpp score compared with https://qwenlm.github.io/blog/qwen2/
dataset version metric mode qwen2-72b-vllm
-------------------------------------- --------- -------- ------ ----------------
Overall - - - -
Exam - - - -
Language - - - -
Knowledge - - - -
Understanding - - - -
Reasoning - - - -
--------- 考试 Exam --------- - - - -
ceval - - - -
agieval - - - -
mmlu - - - -
cmmlu - - - -
GaokaoBench - - - -
ARC-c - - - -
ARC-e - - - -
--------- 语言 Language --------- - - - -
WiC - - - -
chid-dev - - - -
afqmc-dev - - - -
WSC - - - -
tydiqa-goldp - - - -
flores_100 - - - -
--------- 知识 Knowledge --------- - - - -
BoolQ - - - -
commonsense_qa - - - -
triviaqa - - - -
nq - - - -
--------- 理解 Understanding --------- - - - -
C3 - - - -
race-middle - - - -
race-high - - - -
openbookqa_fact - - - -
csl_dev - - - -
lcsts - - - -
Xsum - - - -
eprstmt-dev - - - -
lambada - - - -
--------- 推理 Reasoning --------- - - - -
cmnli - - - -
ocnli - - - -
AX_b - - - -
AX_g - - - -
RTE - - - -
COPA - - - -
ReCoRD - - - -
hellaswag - - - -
piqa - - - -
siqa - - - -
math - - - -
mathbench-arithmetic-cloze_en - - - -
mathbench-primary-cloze_cn - - - -
gsm8k - - - -
drop - - - -
openai_humaneval - - - -
mbpp 1e1056 score gen 11.40
bbh - - - -
Other information
A prediction case use qwen2-72b of mbpp is as following:
"0": {
"origin_prompt": "You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n\n[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n\n[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n\n[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to remove first and last occurrence of a given character from the string. Your code should pass these tests:\n\n assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\" \n\n[BEGIN]\n",
"prediction": " 'def remove_Occ(str, char):\r\n first = str.find(char)\r\n last = str.rfind(char)\r\n if first == -1:\r\n return str\r\n elif first == last:\r\n return str[:first] + str[first+1:]\r\n else:\r\n return str[:first] + str[first+1:last] + str[last+1:]' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n largest_num = max(nums)\r\n return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n largest_num = max(nums)\r\n return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your",
"gold": "assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\""
}
As we can see, the prediction
does not start with [BEGIN]
which is the ending string of input prompt by https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23
mbpp_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\nassert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \nassert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n'),
dict(role='BOT', prompt="[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \nassert is_not_prime(10) == True \nassert is_not_prime(35) == True \n'),
dict(role='BOT', prompt="[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n'),
dict(role='BOT', prompt="[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list} \n'),
dict(role='BOT', prompt='[BEGIN]\n'),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
However, MBPPEvaluator
extract answers with patterns starting with [BEGIN]
which get the non-first program among multiple program cases given by the base LLM model
def _process_answer(self, text):
patterns = [
r"\[BEGIN\]\s*'(.*)'\s*\[DONE\]",
r"BEGIN\s*'(.*)'\s*\[DONE\]",
r"\[BEGIN\]\s*'(.*)'\s*DONE",
r"BEGIN\s*'(.*)'\s*DONE",
r"\[BEGIN\]\s*'(.*)\s*\[DONE\]",
r"BEGIN\s*'(.*)\s*\[DONE\]",
r"\[BEGIN\]\s*'(.*)\s*DONE",
r"BEGIN\s*'(.*)\s*DONE",
r'\[BEGIN\]\s*(.*)\s*\[DONE\]',
r'BEGIN\s*(.*)\s*\[DONE\]',
r'\[BEGIN\]\s*(.*)\s*DONE',
r'BEGIN\s*(.*)\s*DONE',
r'```python\s*(.*)\s*```',
r'```\s*(.*)\s*```',
r'```python\s*(.*)\s*$',
r'```\s*(.*)\s*$',
r'(.*)\s*```.*',
r"\[BEGIN\]\s*'(.*)",
r'\[BEGIN\](.*)',
r"'(.*)'\s*\[DONE\]",
]
for p in patterns:
match = re.search(p, text, re.DOTALL)
if match:
text = match.group(1)
break
text = text.split('```')[0]
text = re.split(r"'?\s*\[?DONE\]?", text)[0]
text = text.replace('\\_', '_')
text = text.strip()
return text