llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

How to reproduce zero-shot evals from Table 1 in the blog?

Open eldarkurtic opened this issue 2 years ago • 2 comments

Hi, I am trying to reproduce your zero-shot evals from the Table 1 in the blog: https://www.mosaicml.com/blog/mpt-7b but the numbers I am seeing are much worse than the ones reported in the blog.

The command I am running is: composer eval/eval.py repro_table1.yaml where repro_table1.yaml looks like this:

max_seq_len: 2048
tokenizer_name: EleutherAI/gpt-neox-20b
seed: 1
precision: amp_bf16

# Tokenizer
tokenizer:
  name: ${tokenizer_name}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: mpt_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b
  attn_config:
    attn_impl: torch

device_eval_batch_size: 16

# FSDP config for model sharding
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: piqa
  dataset_uri: eval/local_data/piqa.jsonl
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: '
-
  label: lambada_openai
  dataset_uri: eval/local_data/lambada_openai.jsonl
  num_fewshot: [0]
  icl_task_type: language_modeling

For simplicity I've included only piqa and lambada_openai in the config above. Zero-shot accuracies I get with this config are: piqa = 0.497 vs 0.799 (blog), and lambada = 0.6848 vs 0.703 (blog).

Any idea what I'm doing wrong here?

eldarkurtic avatar May 27 '23 13:05 eldarkurtic

https://github.com/mosaicml/llm-foundry/issues/59 and https://github.com/mosaicml/llm-foundry/issues/88 discuss eval results. let us know if that answers your questions.

vchiley avatar May 27 '23 19:05 vchiley

yes, with name: hf_causal_lm I've been able to reproduce most of the results in the blog (modulo some small diffs) except for Jeopardy and MMLU, where I see a bit larger diffs: 3 points lower average score on Jeopardy (blog=0.308 vs mine=0.27978), and 0.5 point lower average score on MMLU (blog= 0.296 vs mine=0.29146).

These two tasks contain multiple sub-tasks, how are you aggregating their results? Just average or perhaps a weighted averaged based on their sizes?

eldarkurtic avatar May 28 '23 13:05 eldarkurtic