lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

always get acc,acc_norm, perplexity =1 on triviaqa task based on llama2 model

Open learner-crapy opened this issue 1 year ago • 4 comments

I use the following command to run triviaqa task lm_eval --model hf \ --model_args pretrained=../llama/models_hf/7B \ --tasks triviaqa \ --num_fewshot 1 \ --device cuda:2 \ --batch_size 8 I just get acc_norm=1, it's same when I use acc or perplexity indicator. image

image

image

learner-crapy avatar Jan 03 '24 01:01 learner-crapy

Hi! Could you provide the YAML file and codebase commit you are using to evaluate triviaqa?

This output seems quite strange given that triviaqa uses none of these metrics in its config.

I can't seem to replicate your result when I try locally to run triviaqa on gpt2.

haileyschoelkopf avatar Jan 03 '24 16:01 haileyschoelkopf

Hi, Thank you for your response. with the exact_match indicator, I get value=0.07. image

here is the YAML I used, I only changed the metric.

task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
  until:
    - "\n"
    - "."
    - ","
  do_sample: false
  temperature: 0.0
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace
      - function: take_first
target_delimiter: " "
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  version: 2.0

For the code, I used the following command and changed nothing except the above YAML file.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git

learner-crapy avatar Jan 03 '24 17:01 learner-crapy

those metrics you used are currently only supported for loglikelihood or multiple_choice output_type tasks. In your case, this could be typically achieved by setting output_type: loglikelihood, but is complicated by triviaqa’s use of multiple gold standard answers.

We’ll make sure that running this will error out explicitly in future to avoid confusion.

We are also working on making metrics more easy to understand and add in the library in general!

haileyschoelkopf avatar Jan 03 '24 18:01 haileyschoelkopf

Thanks a lot.

learner-crapy avatar Jan 03 '24 18:01 learner-crapy