lm-evaluation-harness
lm-evaluation-harness copied to clipboard
always get acc,acc_norm, perplexity =1 on triviaqa task based on llama2 model
I use the following command to run triviaqa task
lm_eval --model hf \ --model_args pretrained=../llama/models_hf/7B \ --tasks triviaqa \ --num_fewshot 1 \ --device cuda:2 \ --batch_size 8
I just get acc_norm=1, it's same when I use acc or perplexity indicator.
Hi! Could you provide the YAML file and codebase commit you are using to evaluate triviaqa?
This output seems quite strange given that triviaqa uses none of these metrics in its config.
I can't seem to replicate your result when I try locally to run triviaqa on gpt2.
Hi, Thank you for your response. with the exact_match indicator, I get value=0.07.
here is the YAML I used, I only changed the metric.
task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
until:
- "\n"
- "."
- ","
do_sample: false
temperature: 0.0
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
target_delimiter: " "
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 2.0
For the code, I used the following command and changed nothing except the above YAML file.
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
those metrics you used are currently only supported for loglikelihood or multiple_choice output_type tasks. In your case, this could be typically achieved by setting output_type: loglikelihood
, but is complicated by triviaqa’s use of multiple gold standard answers.
We’ll make sure that running this will error out explicitly in future to avoid confusion.
We are also working on making metrics more easy to understand and add in the library in general!
Thanks a lot.