llm-foundry
llm-foundry copied to clipboard
How to reproduce zero-shot evals from Table 1 in the blog?
Hi, I am trying to reproduce your zero-shot evals from the Table 1 in the blog: https://www.mosaicml.com/blog/mpt-7b but the numbers I am seeing are much worse than the ones reported in the blog.
The command I am running is: composer eval/eval.py repro_table1.yaml where repro_table1.yaml looks like this:
max_seq_len: 2048
tokenizer_name: EleutherAI/gpt-neox-20b
seed: 1
precision: amp_bf16
# Tokenizer
tokenizer:
name: ${tokenizer_name}
kwargs:
model_max_length: ${max_seq_len}
model:
name: mpt_causal_lm
pretrained: true
pretrained_model_name_or_path: mosaicml/mpt-7b
attn_config:
attn_impl: torch
device_eval_batch_size: 16
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
icl_tasks:
-
label: piqa
dataset_uri: eval/local_data/piqa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: '
-
label: lambada_openai
dataset_uri: eval/local_data/lambada_openai.jsonl
num_fewshot: [0]
icl_task_type: language_modeling
For simplicity I've included only piqa and lambada_openai in the config above. Zero-shot accuracies I get with this config are: piqa = 0.497 vs 0.799 (blog), and lambada = 0.6848 vs 0.703 (blog).
Any idea what I'm doing wrong here?
https://github.com/mosaicml/llm-foundry/issues/59 and https://github.com/mosaicml/llm-foundry/issues/88 discuss eval results. let us know if that answers your questions.
yes, with name: hf_causal_lm I've been able to reproduce most of the results in the blog (modulo some small diffs) except for Jeopardy and MMLU, where I see a bit larger diffs: 3 points lower average score on Jeopardy (blog=0.308 vs mine=0.27978), and 0.5 point lower average score on MMLU (blog= 0.296 vs mine=0.29146).
These two tasks contain multiple sub-tasks, how are you aggregating their results? Just average or perhaps a weighted averaged based on their sizes?