llm-foundry
llm-foundry copied to clipboard
Reproduce result of Boolq on LLaMA-7B
Hi
The zeroshot performance on BoolQ in LLaMA paper is 76.5. While the llm-foundry only 62.16 (zero-shot) when following tasks.yaml
. The result in blog is a few-shot ? How about the zero-shot result of BoolQ using llm-foundry ? Where can I find the config to reproduce the result in blog .
---update---
metrics/boolq/10-shot/InContextLearningMultipleChoiceAccuracy: 0.734413206577301
We found that the zero shot performance of LLaMa on boolq was 0.767. Can you let me know how you produced this 0.62 number?
@bmosaicml
Here is yamls/hf_eval.yaml
used, and I run WORLD_SIZE=8 composer eval.py yamls/hf_eval.yaml
to evaluate.
max_seq_len: 2048
seed: 1
model_name_or_path: LLaMA-7B_hf/
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 4
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
@bmosaicml The datasets reproduce success are list as following: metrics/piqa/5-shot/InContextLearningMultipleChoiceAccuracy: 0.800000011920929 metrics/lambada_openai/0-shot/InContextLearningLMAccuracy: 0.7379844784736633 winogrande/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7005 copa/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7788
The datasets reproduce fail are list as following: arc_easy/0-shot/InContextLearningMultipleChoiceAccuracy: 0.4242 arc_challenge/0-shot/InContextLearningMultipleChoiceAccuracy: 0.3579931855201721 copa/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7788 boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.6216381192207336
It may be because we used this model: https://huggingface.co/huggyllama/llama-7b
I will try to rerun with the model you linked and see how it performs
@bmosaicml , @vchiley
I can't reproduce the zero-shot results on boolq either. For both llama-7b
and mpt-7b
. My two yaml scripts are:
max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
init_device: cpu
pretrained: true
pretrained_model_name_or_path: ${model_name_or_path}
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 16
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
For mpt-7b
:
seed: 1
model_name_or_path: mosaicml/mpt-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: mpt_causal_lm
init_device: meta
pretrained: true
pretrained_model_name_or_path: ${model_name_or_path}
config_overrides:
max_seq_len: ${max_seq_len}
attn_config:
attn_impl: triton
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 64
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
And these are run with command: python eval/eval.py eval/yamls/llama_7b_eval.yaml
and python eval/eval.py eval/yamls/mpt_7b_eval.yaml
I get the following results for zero-shot boolq:
For llama-7b
:
Ran eval in: 1823.6155874729156 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.49969419836997986
mpt-7b
Ran eval in: 225.67394590377808 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.40948012471199036
Any idea why this could be the case? Can you share the exact command and yaml file you used to generate the results in the table 1 here: https://www.mosaicml.com/blog/mpt-7b?
This issue should get reopened. I tried evaluating MMLU and got near 25% score with everything. Seems like something in the InContextLearningMultipleChoice stuff is completely broken, but not sure what yet. I'm currently debugging and trying to figure out what is going on, and possibly going to create a simplified version of multiple choice stuff to just confirm there is an issue and see what is going on because I want to evaluate some multiple choice tasks. I was even experimenting with an opt-350m model (and several other models which also didn't work) which scores at least 40% on some tasks and yet with this code it's scoring in the 25% range (broken since there are 4 answers, as good as random), so it doesn't have anything to do with the model I don't think, it seems to be a more general problem.
@ashim95 We've observed an issue when using FSDP on a single GPU. Could you please try with either not using FSDP (comment out the FSDP section of the config), or run on multiple gpus (with composer ...
).
@ianupright What exactly are you running?
@dakinggg I ran the mpt-7b
model without fsdp with the following config:
seed: 1
model_name_or_path: mosaicml/mpt-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: mpt_causal_lm
init_device: cpu
pretrained: true
pretrained_model_name_or_path: ${model_name_or_path}
config_overrides:
max_seq_len: ${max_seq_len}
attn_config:
attn_impl: triton
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 64
# FSDP config for model sharding
#fsdp_config:
# sharding_strategy: FULL_SHARD
# mixed_precision: PURE
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
and I get the following results:
Eval metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.6211
Ran eval in: 427.96136689186096 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.621100902557373
The numbers are still not where they are supposed to be. Could you share the config file and commit id you used to report the numbers?
Thanks,
I'm running MMLU.
What I don't understand is this collate_fn in InContextLearningMultipleChoiceTaskDataset:
for choice in choices:
context_enc = preamble['input_ids'] + context['input_ids']
continuation_enc = choice['input_ids']
inp, continuation_span = _make_padded_input(context_enc, continuation_enc, self.max_seq_len,
self.pad_tok_id)
inputs.append(inp)
continuation_indices.append(continuation_span)
It's looping through each choice of A, B, C, D, and creating a new input to be predicted for each case, whereas shouldn't it just be doing one prediction? It's also 4x inefficient/slower to evaluate. I think the prediction failings may have something to do with this, but in any event, it would be good to have an evaluation that is more efficient/simple.
@ashim95 Ah, you'll want to base off the hf_eval.yaml
, not the mpt_eval.yaml
, when loading the model from the huggingface hub. the mpt_eval.yaml
is designed for loading the model from a composer checkpoint. Sorry for the confusion, we are working on cleaning this up and will have a full eval script out soon. 0.62 is equivalent to the majority baseline for boolq if i recall correctly :)
So, try
max_seq_len: 2048
seed: 1
model_name_or_path: mosaicml/mpt-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 4
precision: fp32
# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
icl_tasks:
-
label: boolq
dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: multiple_choice
continuation_delimiter: 'Answer: ' # this separates questions from answers
@ianupright The MC dataset is paired with an MC metric which takes the most likely of the possible continuations as the predicted answer (https://github.com/mosaicml/composer/blob/6262846fbe4879979a2017d1c98993c05e082a4f/composer/metrics/nlp.py#L414). There are many different possible ways to do MC eval though. What model/command/yaml are you actually running?
hello, thanks for your nice work. We can reproduce the score in most datasets except winograd_wsc and winogrande. The score of winograd_wsc is 0.8857142925262451 and the score of winogrande is 0.7004716992378235 in llama-7b. A little diff with the score in paper winograd_wsc 0.807, winogrande 0.675. Can you have a try or give some advice?
The command is "WORLD_SIZE=8 composer eval/eval.py eval/yamls/hf_eval.yaml"
The config is
max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 16
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
icl_tasks:
-
label: winogrande
dataset_uri: eval/local_data/winogrande.jsonl
num_fewshot: [0]
icl_task_type: schema
and metrics/jeopardy/0-shot/InContextLearningLMAccuracy: 0.36367926001548767, a littile difference with 0.334 in the picture for llama-7b.
command is "WORLD_SIZE=8 composer eval/eval.py eval/yamls/hf_eval.yaml"
config is
max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 16
# FSDP config for model sharding
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
icl_tasks:
-
label: jeopardy
dataset_uri: eval/local_data/jeopardy_all.jsonl
num_fewshot: [0]
icl_task_type: language_modeling
continuation_delimiter: 'Answer: '
has_categories: False
Hi @dyy401453043, those three numbers (llama-7b on winograd, winogrande, and jeopardy) are a mistake in our table, and the numbers you stated are correct. We will update the table shortly, and soon have a full script for reproducing all of the eval scores that we report. Thank you for helping to check our work!