opencompass
opencompass copied to clipboard
[Bug] MBPP evaluator cannot extract the correct anwser
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
torch2.2.0+vllm-0.4.0
Reproduces the problem - code/configuration sample
evaluate mbpp + qwen2-72b-vllm with following config
from mmengine.config import read_base
with read_base():
from ...mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
from opencompass.models import VLLM
models = [
dict(
type=VLLM,
abbr='qwen2-72b-vllm',
path='Qwen/Qwen2-72B',
model_kwargs=dict(tensor_parallel_size=4),
max_out_len=1024,
max_seq_len=8192,
batch_size=16,
generation_kwargs=dict(temperature=0),
run_cfg=dict(num_gpus=4),
)
]
Reproduces the problem - command or script
evaluate mbpp + qwen2-72b-vllm with following config
Reproduces the problem - error message
Unexpected mbpp score compared with https://qwenlm.github.io/blog/qwen2/
dataset version metric mode qwen2-72b-vllm
-------------------------------------- --------- -------- ------ ----------------
Overall - - - -
Exam - - - -
Language - - - -
Knowledge - - - -
Understanding - - - -
Reasoning - - - -
--------- 考试 Exam --------- - - - -
ceval - - - -
agieval - - - -
mmlu - - - -
cmmlu - - - -
GaokaoBench - - - -
ARC-c - - - -
ARC-e - - - -
--------- 语言 Language --------- - - - -
WiC - - - -
chid-dev - - - -
afqmc-dev - - - -
WSC - - - -
tydiqa-goldp - - - -
flores_100 - - - -
--------- 知识 Knowledge --------- - - - -
BoolQ - - - -
commonsense_qa - - - -
triviaqa - - - -
nq - - - -
--------- 理解 Understanding --------- - - - -
C3 - - - -
race-middle - - - -
race-high - - - -
openbookqa_fact - - - -
csl_dev - - - -
lcsts - - - -
Xsum - - - -
eprstmt-dev - - - -
lambada - - - -
--------- 推理 Reasoning --------- - - - -
cmnli - - - -
ocnli - - - -
AX_b - - - -
AX_g - - - -
RTE - - - -
COPA - - - -
ReCoRD - - - -
hellaswag - - - -
piqa - - - -
siqa - - - -
math - - - -
mathbench-arithmetic-cloze_en - - - -
mathbench-primary-cloze_cn - - - -
gsm8k - - - -
drop - - - -
openai_humaneval - - - -
mbpp 1e1056 score gen 11.40
bbh - - - -
Other information
A prediction case use qwen2-72b of mbpp is as following:
"0": {
"origin_prompt": "You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n\n[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n\n[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n\n[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to remove first and last occurrence of a given character from the string. Your code should pass these tests:\n\n assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\" \n\n[BEGIN]\n",
"prediction": " 'def remove_Occ(str, char):\r\n first = str.find(char)\r\n last = str.rfind(char)\r\n if first == -1:\r\n return str\r\n elif first == last:\r\n return str[:first] + str[first+1:]\r\n else:\r\n return str[:first] + str[first+1:last] + str[last+1:]' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n largest_num = max(nums)\r\n return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n largest_num = max(nums)\r\n return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your",
"gold": "assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\""
}
As we can see, the prediction does not start with [BEGIN] which is the ending string of input prompt by https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23
mbpp_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\nassert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \nassert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n'),
dict(role='BOT', prompt="[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \nassert is_not_prime(10) == True \nassert is_not_prime(35) == True \n'),
dict(role='BOT', prompt="[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n'),
dict(role='BOT', prompt="[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n "),
dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list} \n'),
dict(role='BOT', prompt='[BEGIN]\n'),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
However, MBPPEvaluator extract answers with patterns starting with [BEGIN] which get the non-first program among multiple program cases given by the base LLM model
def _process_answer(self, text):
patterns = [
r"\[BEGIN\]\s*'(.*)'\s*\[DONE\]",
r"BEGIN\s*'(.*)'\s*\[DONE\]",
r"\[BEGIN\]\s*'(.*)'\s*DONE",
r"BEGIN\s*'(.*)'\s*DONE",
r"\[BEGIN\]\s*'(.*)\s*\[DONE\]",
r"BEGIN\s*'(.*)\s*\[DONE\]",
r"\[BEGIN\]\s*'(.*)\s*DONE",
r"BEGIN\s*'(.*)\s*DONE",
r'\[BEGIN\]\s*(.*)\s*\[DONE\]',
r'BEGIN\s*(.*)\s*\[DONE\]',
r'\[BEGIN\]\s*(.*)\s*DONE',
r'BEGIN\s*(.*)\s*DONE',
r'```python\s*(.*)\s*```',
r'```\s*(.*)\s*```',
r'```python\s*(.*)\s*$',
r'```\s*(.*)\s*$',
r'(.*)\s*```.*',
r"\[BEGIN\]\s*'(.*)",
r'\[BEGIN\](.*)',
r"'(.*)'\s*\[DONE\]",
]
for p in patterns:
match = re.search(p, text, re.DOTALL)
if match:
text = match.group(1)
break
text = text.split('```')[0]
text = re.split(r"'?\s*\[?DONE\]?", text)[0]
text = text.replace('\\_', '_')
text = text.strip()
return text
your used prompt template has been deprecated, please try configs/datasets/mbpp/mbpp_gen_830460.py
configs/datasets/mbpp/mbpp_gen_830460.py
Thanks for the quick reply! @tonysy
It seems has the same problem since the input prompt ends with [BEGIN] https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23 , thus the response would not start with it, while MBPPEvaluator only extract [BEGIN] started anwser.
Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.
Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.
hello, has this bug been fixed?
One quick solution is to rewrite the _process_answer(self, text) as follows, it should extract the first substring that ended with [DONE]. def _process_answer(self, text): text = text.strip() match = re.search(r"('\s*|)([DONE]|DONE)", text) if match: text = text[:match.start()] match = re.search(r"([BEGIN]|BEGIN)('\s*|)", text) if match: text = text[match.end():] text = text.strip() if text.startswith("'"): text = text[1:] if text.endswith("'"): text = text[:-1] return text
One quick solution is to rewrite the _process_answer(self, text) as follows, it should extract the first substring that ended with [DONE]. def _process_answer(self, text): text = text.strip() match = re.search(r"('\s*|)([DONE]|DONE)", text) if match: text = text[:match.start()] match = re.search(r"([BEGIN]|BEGIN)('\s*|)", text) if match: text = text[match.end():] text = text.strip() if text.startswith("'"): text = text[1:] if text.endswith("'"): text = text[:-1] return text
Can someone update the _process_answer(self, text) soon?