opencompass
opencompass copied to clipboard
新老版本的结果差距
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
{'CUDA available': True, 'CUDA_HOME': '/usr', 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0', 'GPU 0,1,2,3,4,5': 'NVIDIA A800 80GB PCIe', 'MMEngine': '0.10.4', 'MUSA available': False, 'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24', 'OpenCV': '4.9.0', 'PyTorch': '2.3.0+cu121', 'PyTorch compiling details': 'PyTorch built with:\n' ...
重现问题 - 代码/配置示例
configs/models 下面的配置都是 api_meta_template = dict( round=[ dict(role="HUMAN", api_role="HUMAN"), dict(role="BOT", api_role="BOT", generate=True), ], )
models = [ dict( abbr="vanilla_llama-2-7b-chat_V1", # type=Llama2Chat, type=HuggingFaceCausalLM, path="xxx", tokenizer_path="xxx", tokenizer_kwargs=dict(padding_side='left', truncation_side='left', use_fast=False, ), meta_template=api_meta_template, max_out_len=100, max_seq_len=2048, batch_size=8, extract_pred_after_decode = True, model_kwargs=dict(device_map='auto'), batch_padding=False, # if false, inference with for-loop without batch padding run_cfg=dict(num_gpus=1, num_procs=1), ), ]
重现问题 - 命令或脚本
python run.py --models llama --datasets triviaqa_gen
重现问题 - 错误信息
我使用去年大概10月份下载的版本,和现在的版本运行triviaqa数据集得到的结果差距较大
outputs得到的configs下面的.py文件如下
**(1) triviaqa ** 原始版本 结果为45 datasets=[ dict(abbr='triviaqa', eval_cfg=dict( evaluator=dict( type='opencompass.datasets.TriviaQAEvaluator'), pred_role='BOT'), infer_cfg=dict( inferencer=dict( max_out_len=50, type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( template=dict( round=[ dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?", role='HUMAN'), dict(prompt='A:', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( type='opencompass.openicl.icl_retriever.ZeroRetriever')), path='./data/triviaqa/', reader_cfg=dict( input_columns=[ 'question', ], output_column='answer', test_split='dev', train_split='dev'), type='opencompass.datasets.TriviaQADataset'), ] models=[ dict(abbr='llama', batch_padding=False, batch_size=8, extract_pred_after_decode=True, max_out_len=100, max_seq_len=2048, meta_template=dict( round=[ dict(api_role='HUMAN', role='HUMAN'), dict(api_role='BOT', generate=True, role='BOT'), ]), model_kwargs=dict( device_map='auto'), path='xx', run_cfg=dict( num_gpus=1, num_procs=1), tokenizer_kwargs=dict( padding_side='left', truncation_side='left', use_fast=False), tokenizer_path='xx', type='opencompass.models.HuggingFaceCausalLM'), ] summarizer=None work_dir='./outputs/default/20231012_212838'
(2)当前版本 结果为55 datasets=[ dict(abbr='triviaqa', eval_cfg=dict( evaluator=dict( type='opencompass.datasets.TriviaQAEvaluator'), pred_role='BOT'), infer_cfg=dict( inferencer=dict( max_out_len=50, type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( template=dict( round=[ dict(prompt="Answer these questions, your answer should be as simple as possible, start your answer with the prompt 'The answer is '.\nQ: {question}?", role='HUMAN'), dict(prompt='A:', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( type='opencompass.openicl.icl_retriever.ZeroRetriever')), path='./data/triviaqa/', reader_cfg=dict( input_columns=[ 'question', ], output_column='answer', test_split='dev', train_split='dev'), type='opencompass.datasets.TriviaQADataset'), ] models=[ dict(abbr='llama', batch_padding=False, batch_size=8, extract_pred_after_decode=True, max_out_len=100, max_seq_len=2048, meta_template=dict( round=[ dict(api_role='HUMAN', role='HUMAN'), dict(api_role='BOT', generate=True, role='BOT'), ]), model_kwargs=dict( device_map='auto'), path='xx', run_cfg=dict( num_gpus=1, num_procs=1), tokenizer_kwargs=dict( padding_side='left', truncation_side='left', use_fast=False), tokenizer_path='xx', type='opencompass.models.HuggingFaceCausalLM'), ] summarizer=dict( summary_groups=[ dict(name='agieval-chinese', subsets=[ 'agieval-gaokao-chinese',
......
work_dir='./outputs/default/20240507_115016'
其他信息
请问是什么原因导致的结果不一样?
评估代码变宽松了
https://github.com/open-compass/opencompass/blob/19d7e630d6216550a56c8df572cce481c22f2ddc/opencompass/datasets/triviaqa.py#L88C1-L89C66