opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] ceval, cmmlu, mmlu 的 gen 对话模板行为不一致,mmlu 的对话模板存在问题

Open LiuLinyun opened this issue 1 year ago • 11 comments

先决条件

  • [X] 我已经搜索过 问题讨论 但未得到预期的帮助。
  • [X] 错误在 最新版本 中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda', 'GCC': 'gcc (Debian 10.2.1-6) 10.2.1 20210110', 'GPU 0,1': 'NVIDIA A100 80GB PCIe', 'MMEngine': '0.9.1', 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89', 'OpenCV': '4.8.1', 'PyTorch': '2.1.0+cu118', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2022.2-Product Build 20220804 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.1.1 (Git Hash ' '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - CUDA Runtime 11.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90\n' ' - CuDNN 8.7\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=11.8, ' 'CUDNN_VERSION=8.7.0, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=old-style-cast ' '-Wno-invalid-partial-specialization ' '-Wno-unused-private-field ' '-Wno-aligned-allocation-unavailable ' '-Wno-missing-braces -fdiagnostics-color=always ' '-faligned-new -Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.13 (main, Nov 1 2023, 14:20:38) [GCC 10.2.1 20210110]', 'TorchVision': '0.16.0+cu121', 'numpy_random_seed': 2147483648, 'opencompass': '0.1.7+6710cbf', 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM


with read_base():
    from .datasets.ceval.ceval_gen import ceval_datasets
    from .datasets.cmmlu.cmmlu_gen import cmmlu_datasets
    from .datasets.mmlu.mmlu_gen import mmlu_datasets

model_path = 'my-model-path'
model_abbr = 'my-model-abbr'

_meta_template = dict(
    round=[
        dict(role='HUMAN', begin='USER:'),
        dict(role='BOT', begin='ASSISTANT:', generate=True),
    ],
)

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr=model_abbr,
        path=model_path,
        tokenizer_path=model_path,
        tokenizer_kwargs=dict(
            padding_side='left',
            truncation_side='left',
            trust_remote_code=True,
            use_fast=False,
        ),
        meta_template=_meta_template,
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto', trust_remote_code=True),
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]


datasets = [*ceval_datasets, *cmmlu_datasets, *mmlu_datasets]

重现问题 - 命令或脚本

python run.py config/eval_mymodel.py

重现问题 - 错误信息

MMLU predict json 结果示例

    "0": {
        "origin_prompt": "USER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state legislature has recently enacted a statute making it a misdemeanor to curse or revile or use obscene or opprobrious language toward or in reference to a police officer perfonning his duties. A student at a state university organized a demonstration on campus to protest the war. The rally was attended by a group of 50 students who shouted anti-war messages at cars passing by. To show his contempt for the United States, the student sewed the American flag to the rear of his jeans. When a police officer saw the flag sown on the student's jeans, he approached and told him to remove the flag or he would be placed under arrest. The student became angered and shouted at the police officer, \"Listen, you bastard, I'll wear this rag anywhere I please. \" The student was subsequently placed under arrest and charged with violating the state statute. The student subsequently brings suit in state court challenging the constitutionality of the statute. The strongest constitutional argument for the student is that\nA. the statute is void for vagueness under the Fourteenth Amendment's due process clause.\nB. the statute is invalid because it violates the petitioner's freedom of speech under the First Amendment.\nC. the statute is an abridgment of freedom of speech under the First Amendment because less restrictive means are available for achieving the same purpose.\nD. the statute is overbroad and consequently invalid under the First and FourteenthAmendments.\nAnswer: ASSISTANT: D\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state has recently enacted a statute prohibiting the disposal of any nuclear wastes within the state. This law does not contravene or conflict with any federal statutes. A man operates a company in the state that is engaged in the disposal of nuclear wastes. Subsequent to the passage of the state statute, the man, not yet aware of the new law, entered into contracts with many out-of-state firms to dispose of their nuclear wastes in the state. On account of this new law, however, the man will be unable to perform these contracts. Assume that the man has standing to challenge this state law. Which of the following presents his strongest constitutional grounds to challenge the state law prohibiting the disposal of nuclear wastes within the state?\nA. The commerce clause.\nB. The equal protection clause of the Fourteenth Amendment.\nC. The privileges and immunities clause of Article IV, Section 2. \nD. The contract clause.\nAnswer: ASSISTANT: A\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: Judge took judicial notice of some facts at the beginning of the trial. Which of the following is not an appropriate kind of fact for judicial notice?\nA. Indisputable facts.\nB. Facts that have been asserted by individual political organizations.\nC. Facts recognized to be true by common knowledge.\nD. Facts capable of scientific verification.\nAnswer: ASSISTANT: B\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: On October 1, 1980, a developer, owner of several hundred acres in a rural county, drafted a general development plan for the area. The duly recorded plan imposed elaborate limitations and restrictions upon the land in the plan, which was to be developed as a residential district. The restrictions were to extend to all persons acquiring any of the lots and to their heirs, assigns, and lessees. It was further provided that all subsequent owners would be charged with due notice of the restrictions. Among those restrictions in the general plan were the following:(22) A franchise right is created in a strip of land 10 feet in width along the rear of each lot for the use of public utility companies with right of ingress and egress. (23) No house or structure of any kind shall be built on the aforementioned strip of land running through the said blocks. In 2000, a retiree purchased one of the lots, built a house, and erected a fence in the rear of his property within the restricted area. In 2004, a teacher purchased a lot adjacent to the retiree's property and built a new house. Two years later, a librarian purchased the lot that adjoined the teacher's property. The three deeds to those properties each contained references to the deed book where the general plan was recorded. In 2008, the librarian began the construction of a seven-foot post-and-rail fence along the line dividing his lot with the teacher's, and along the center of the area subject to the franchise right. Although the teacher objected to its construction, the fence was completed. If the teacher seeks a mandatory injunction to compel removal of the librarian's fence, the court will most likely\nA. grant relief, because the fence was in violation of the easement restriction. \nB. grant relief, because the encroachment of the fence violated the restriction in the original plan. \nC. deny relief, because the teacher failed to enforce the restriction against the retiree. \nD. deny relief, because the fence would not be construed as \"a structure\" within the terms of the restriction. \nAnswer: ASSISTANT: B\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A son owed a creditor $5,000. The son's father contacted the creditor and told him that he wanted to pay the son's debt. The father signed a document that stated the father would pay the son's debt at a rate of $500 a month for 10 months. The creditor made no written or oral commitment to forbear to sue the son to collect the $5,000 debt, and the father made no oral or written request for any such forbearance. For the next five months, the father made and the creditor accepted the $500 monthly payments as agreed. During that period, the creditor, in fact, did forbear to take any legal action against the son. However, the father then informed the creditor that he would make no further payments on the debt. Which of the following is the most persuasive argument that the father is liable to the creditor under the terms of their agreement?\nA. The father's promise and the creditor's reliance thereon, if proved, gave rise to a valid claim by the creditor against the father based on the doctrine of promissory estoppel. \nB. Because it was foreseeable that the father's promise would induce the creditor to forbear taking any action against the son, such forbearance was, as a matter of law, a bargained-for consideration for the father's promise. \nC. The father's five payments to the creditor totaling $2,500 manifested a serious intent on the father's part to be contractually bound, and such manifestation is generally recognized as an effective substitute for consideration. \nD. By assuming the antecedent debt obligation that the son owed to the creditor, the father became a surety whose promise to the creditor was enforceable, since it was in writing and supported by adequate consideration. \nAnswer: ASSISTANT: A\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQ: One afternoon, a pilot was flying a small airplane when it suddenly ran out of gas. As he was coming in for an emergency landing, the plane crossed into a neighboring state at a very low altitude. At this time, a 9-year-old boy was walking to school when he was struck and injured by an object, which may have fallen from the plane. In federal court, a negligence suit was brought against the pilot by the father of the boy for his son. Accompanied by his father, the boy had visited an attorney for preliminary discussions regarding the case. However, the father did not retain the attorney to represent his son in the lawsuit. Instead, the father hired another lawyer to handle the case. At trial, the pilot's attorney calls the consulting attorney to testify what the boy had said to him regarding his physical condition during the consultation that the attorney had had with the boy and his father. The attorney's testimony is\nA. admissible, because the attorney-client privilege was waived by the filing of the lawsuit.\nB. admissible, because there is no privilege of confidentiality when a person other than the client is present at the attorney-client consultation.\nC. inadmissible, because the attorney-client privilege prevents such a breach of confidential communications.\nD. inadmissible, because it was a statement of physical condition not made for the purpose of obtaining medical treatment.\nA: ASSISTANT: ",
        "prediction": "1\nB: ASSISTANT: 2\nC: ASSISTANT: 3\nD: ASSISTANT: 4\nAnswer: ASSISTANT: 4\nUSER: There is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A man was convicted of murder in a state court. He appealed to the state supreme court, which affirmed the conviction. The man then filed a petition for a writ of",
        "gold": "C"
    }

输入给模型的 prompt 用的是 Question: Answer: 最后真正要模型回答的问题用的是 Q: A: 而且这个 A: 还是在问题 prompt 中,后面紧跟的是 ASSISTANT 角色提示

其他信息

mmlu 的数据集配置代码有问题,没有考虑 ice_template 和 prompt_template 的冲突

LiuLinyun avatar Nov 13 '23 11:11 LiuLinyun

Thanks for the reporting. Please try the prompt template

mmlu_gen_23a9a9
mmlu_gen_79e572

We will investigate the influence of the mentioned problem.

tonysy avatar Nov 13 '23 11:11 tonysy

mmlu_gen_23a9a9 mmlu_gen_79e572 These two generate config also have some problems: The last input sample's "Answer:" should in BOT's template string if there will be some role token like "ASSISTANT" before string "Answer: "

LiuLinyun avatar Nov 13 '23 12:11 LiuLinyun

May you should redesign meta-template and ice-template usage.

LiuLinyun avatar Nov 13 '23 12:11 LiuLinyun

Sorry,I believe there may be some misunderstanding. In our design, "Answer" in user's prompt is expected.

tonysy avatar Nov 13 '23 16:11 tonysy

Here is an example with ChatGLM web

image

In this example, we can formulate this process as:

<USER> Question: 2+2=?\n Answer:4\nQuestion: 1+1=?\nAnswer:<Bot>2

tonysy avatar Nov 13 '23 16:11 tonysy

Some models will not output the answer with the expected format, because in the few-shot cases, there is not any <Bot> between Answer: and <the answer>, but there is a <Bot> between in the last question that model need to answer, which will make some model generate with doubts how to generage.

LiuLinyun avatar Nov 14 '23 08:11 LiuLinyun

I (and most developers) hope the final prompt would be like, making chatml template as an example,

<|im_start|>user
2+2=?<|im_end|>
<|im_start|>assistant

The str in python is <|im_start|>user\n2+2=?<|im_end|>\n<|im_start|>assistant\n

If we set max_output_len=1, a good model is expected to generate and only generate "4".

TissueC avatar Nov 22 '23 05:11 TissueC

I (and most developers) hope the final prompt would be like, making chatml template as an example,

<|im_start|>user
2+2=?<|im_end|>
<|im_start|>assistant

The str in python is <|im_start|>user\n2+2=?<|im_end|>\n<|im_start|>assistant\n

If we set max_output_len=1, a good model is expected to generate and only generate "4".

Thanks for the feedback, we are working on the template refactor, and will take the chatml into consideration.

tonysy avatar Nov 23 '23 07:11 tonysy

Any solution?

ntudy avatar Nov 30 '23 12:11 ntudy

This problem seems to have not been solved. Please consider fix it.

StructSeeker avatar May 10 '24 03:05 StructSeeker

It seems CLUE_C3_gen suffers from the same problem when running qwen.

StructSeeker avatar May 13 '24 19:05 StructSeeker