opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] MBPP evaluator cannot extract the correct anwser

Open guoshengCS opened this issue 1 year ago • 4 comments

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

torch2.2.0+vllm-0.4.0

Reproduces the problem - code/configuration sample

evaluate mbpp + qwen2-72b-vllm with following config

from mmengine.config import read_base

with read_base():
    from ...mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

from opencompass.models import VLLM

models = [
    dict(
        type=VLLM,
        abbr='qwen2-72b-vllm',
        path='Qwen/Qwen2-72B',
        model_kwargs=dict(tensor_parallel_size=4),
        max_out_len=1024,
        max_seq_len=8192,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=4),
    )
]

Reproduces the problem - command or script

evaluate mbpp + qwen2-72b-vllm with following config

Reproduces the problem - error message

Unexpected mbpp score compared with https://qwenlm.github.io/blog/qwen2/

dataset                                 version    metric    mode    qwen2-72b-vllm
--------------------------------------  ---------  --------  ------  ----------------
Overall                                 -          -         -       -
Exam                                    -          -         -       -
Language                                -          -         -       -
Knowledge                               -          -         -       -
Understanding                           -          -         -       -
Reasoning                               -          -         -       -
--------- 考试 Exam ---------           -          -         -       -
ceval                                   -          -         -       -
agieval                                 -          -         -       -
mmlu                                    -          -         -       -
cmmlu                                   -          -         -       -
GaokaoBench                             -          -         -       -
ARC-c                                   -          -         -       -
ARC-e                                   -          -         -       -
--------- 语言 Language ---------       -          -         -       -
WiC                                     -          -         -       -
chid-dev                                -          -         -       -
afqmc-dev                               -          -         -       -
WSC                                     -          -         -       -
tydiqa-goldp                            -          -         -       -
flores_100                              -          -         -       -
--------- 知识 Knowledge ---------      -          -         -       -
BoolQ                                   -          -         -       -
commonsense_qa                          -          -         -       -
triviaqa                                -          -         -       -
nq                                      -          -         -       -
--------- 理解 Understanding ---------  -          -         -       -
C3                                      -          -         -       -
race-middle                             -          -         -       -
race-high                               -          -         -       -
openbookqa_fact                         -          -         -       -
csl_dev                                 -          -         -       -
lcsts                                   -          -         -       -
Xsum                                    -          -         -       -
eprstmt-dev                             -          -         -       -
lambada                                 -          -         -       -
--------- 推理 Reasoning ---------      -          -         -       -
cmnli                                   -          -         -       -
ocnli                                   -          -         -       -
AX_b                                    -          -         -       -
AX_g                                    -          -         -       -
RTE                                     -          -         -       -
COPA                                    -          -         -       -
ReCoRD                                  -          -         -       -
hellaswag                               -          -         -       -
piqa                                    -          -         -       -
siqa                                    -          -         -       -
math                                    -          -         -       -
mathbench-arithmetic-cloze_en           -          -         -       -
mathbench-primary-cloze_cn              -          -         -       -
gsm8k                                   -          -         -       -
drop                                    -          -         -       -
openai_humaneval                        -          -         -       -
mbpp                                    1e1056     score     gen     11.40
bbh                                     -          -         -       -

Other information

A prediction case use qwen2-72b of mbpp is as following:

    "0": {
        "origin_prompt": "You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n\n[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n  res = tuple(set(test_tup1) & set(test_tup2))\r\n  return (res)' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n\n[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n    result = False\r\n    for i in range(2,int(math.sqrt(n)) + 1):\r\n        if n % i == 0:\r\n            result = True\r\n    return result' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n\n[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n  largest_nums = hq.nlargest(n, nums)\r\n  return largest_nums' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to remove first and last occurrence of a given character from the string. Your code should pass these tests:\n\n assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\"  \n\n[BEGIN]\n",
        "prediction": " 'def remove_Occ(str, char):\r\n  first = str.find(char)\r\n  last = str.rfind(char)\r\n  if first == -1:\r\n    return str\r\n  elif first == last:\r\n    return str[:first] + str[first+1:]\r\n  else:\r\n    return str[:first] + str[first+1:last] + str[last+1:]' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n  largest_num = max(nums)\r\n  return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n  largest_num = max(nums)\r\n  return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your",
        "gold": "assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\""
    }

As we can see, the prediction does not start with [BEGIN] which is the ending string of input prompt by https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23

mbpp_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\nassert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \nassert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n  res = tuple(set(test_tup1) & set(test_tup2))\r\n  return (res)' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \nassert is_not_prime(10) == True \nassert is_not_prime(35) == True \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n    result = False\r\n    for i in range(2,int(math.sqrt(n)) + 1):\r\n        if n % i == 0:\r\n            result = True\r\n    return result' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n  largest_nums = hq.nlargest(n, nums)\r\n  return largest_nums' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list}  \n'),
                dict(role='BOT', prompt='[BEGIN]\n'),
            ],
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer, max_out_len=512),
)

However, MBPPEvaluator extract answers with patterns starting with [BEGIN] which get the non-first program among multiple program cases given by the base LLM model

    def _process_answer(self, text):
        patterns = [
            r"\[BEGIN\]\s*'(.*)'\s*\[DONE\]",
            r"BEGIN\s*'(.*)'\s*\[DONE\]",
            r"\[BEGIN\]\s*'(.*)'\s*DONE",
            r"BEGIN\s*'(.*)'\s*DONE",
            r"\[BEGIN\]\s*'(.*)\s*\[DONE\]",
            r"BEGIN\s*'(.*)\s*\[DONE\]",
            r"\[BEGIN\]\s*'(.*)\s*DONE",
            r"BEGIN\s*'(.*)\s*DONE",
            r'\[BEGIN\]\s*(.*)\s*\[DONE\]',
            r'BEGIN\s*(.*)\s*\[DONE\]',
            r'\[BEGIN\]\s*(.*)\s*DONE',
            r'BEGIN\s*(.*)\s*DONE',
            r'```python\s*(.*)\s*```',
            r'```\s*(.*)\s*```',
            r'```python\s*(.*)\s*$',
            r'```\s*(.*)\s*$',
            r'(.*)\s*```.*',
            r"\[BEGIN\]\s*'(.*)",
            r'\[BEGIN\](.*)',
            r"'(.*)'\s*\[DONE\]",
        ]
        for p in patterns:
            match = re.search(p, text, re.DOTALL)
            if match:
                text = match.group(1)
                break
        text = text.split('```')[0]
        text = re.split(r"'?\s*\[?DONE\]?", text)[0]
        text = text.replace('\\_', '_')
        text = text.strip()
        return text

guoshengCS avatar Aug 08 '24 06:08 guoshengCS

your used prompt template has been deprecated, please try configs/datasets/mbpp/mbpp_gen_830460.py

tonysy avatar Aug 08 '24 06:08 tonysy

configs/datasets/mbpp/mbpp_gen_830460.py

Thanks for the quick reply! @tonysy

It seems has the same problem since the input prompt ends with [BEGIN] https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23 , thus the response would not start with it, while MBPPEvaluator only extract [BEGIN] started anwser.

guoshengCS avatar Aug 08 '24 06:08 guoshengCS

Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.

tonysy avatar Aug 08 '24 06:08 tonysy

Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.

hello, has this bug been fixed?

FlyCarrot avatar Aug 14 '24 14:08 FlyCarrot

One quick solution is to rewrite the _process_answer(self, text) as follows, it should extract the first substring that ended with [DONE]. def _process_answer(self, text): text = text.strip() match = re.search(r"('\s*|)([DONE]|DONE)", text) if match: text = text[:match.start()] match = re.search(r"([BEGIN]|BEGIN)('\s*|)", text) if match: text = text[match.end():] text = text.strip() if text.startswith("'"): text = text[1:] if text.endswith("'"): text = text[:-1] return text

huozhi621 avatar Dec 12 '24 07:12 huozhi621

One quick solution is to rewrite the _process_answer(self, text) as follows, it should extract the first substring that ended with [DONE]. def _process_answer(self, text): text = text.strip() match = re.search(r"('\s*|)([DONE]|DONE)", text) if match: text = text[:match.start()] match = re.search(r"([BEGIN]|BEGIN)('\s*|)", text) if match: text = text[match.end():] text = text.strip() if text.startswith("'"): text = text[1:] if text.endswith("'"): text = text[:-1] return text

Can someone update the _process_answer(self, text) soon?

huozhi621 avatar Dec 12 '24 07:12 huozhi621