opencompass
opencompass copied to clipboard
[Bug] Failed to reproduce llama2-70b-base on triviaqa
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
其他任务正常
重现问题 - 代码/配置示例
from opencompass.openicl.icl_prompt_template import PromptTemplate from opencompass.openicl.icl_retriever import ZeroRetriever from opencompass.openicl.icl_inferencer import GenInferencer from opencompass.datasets import TriviaQADataset, TriviaQAEvaluator
triviaqa_reader_cfg = dict( input_columns=['question'], output_column='answer', train_split='dev', test_split='dev')
triviaqa_infer_cfg = dict( prompt_template=dict( type=PromptTemplate, template=dict( round=[ dict(role='HUMAN', prompt='Question: {question}\nAnswer: '), ], )), retriever=dict(type=ZeroRetriever), inferencer=dict(type=GenInferencer, max_out_len=50))
triviaqa_eval_cfg = dict( evaluator=dict(type=TriviaQAEvaluator), pred_role='BOT')
triviaqa_datasets = [ dict( type=TriviaQADataset, abbr='triviaqa', path='./data/triviaqa/', reader_cfg=triviaqa_reader_cfg, infer_cfg=triviaqa_infer_cfg, eval_cfg=triviaqa_eval_cfg) ]
重现问题 - 命令或脚本
CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py configs/eval_llama2_70b_turbomind.py -w outputs/llama2-70b
重现问题 - 错误信息
在triviaqa数据集llama2-70b-base复现结果为3.8(榜单上是70.7)
其他信息
是否是配置文件有问题?
能否提供一下模型输出的文件
How about evaluating llama2-70b-base without turbormind?
How about evaluating llama2-70b-base without turbormind?
The result which use huggingface generate function is the same as llama2-70b-base with turbormind.(3.8)
Similar with llama2-7b. (BTW, I'm using a local llama2-7b-hf, while comparing to the Llama-2-7B on leaderboard.) leaderboard: 52.8 (triviaqa_gen_3e39a5) my results: 2.15 (triviaqa_gen_3e39a5), 52.44 (triviaqa_gen_0356ec)
my model outputs: llama-2-7b-hf.zip
Fixed in latest version, welcome to try. Feel free to re-open if needed.