llm-foundry Reproduce result of Boolq on LLaMA-7B

Hi The zeroshot performance on BoolQ in LLaMA paper is 76.5. While the llm-foundry only 62.16 (zero-shot) when following tasks.yaml. The result in blog is a few-shot ? How about the zero-shot result of BoolQ using llm-foundry ? Where can I find the config to reproduce the result in blog .

---update---

metrics/boolq/10-shot/InContextLearningMultipleChoiceAccuracy: 0.734413206577301

May 06 '23 06:05 mx8435

We found that the zero shot performance of LLaMa on boolq was 0.767. Can you let me know how you produced this 0.62 number?

May 06 '23 14:05 bmosaicml

@bmosaicml Here is yamls/hf_eval.yaml used, and I run WORLD_SIZE=8 composer eval.py yamls/hf_eval.yaml to evaluate.

max_seq_len: 2048
seed: 1
model_name_or_path: LLaMA-7B_hf/


# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 4

# FSDP config for model sharding
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

May 07 '23 12:05 mx8435

@bmosaicml The datasets reproduce success are list as following: metrics/piqa/5-shot/InContextLearningMultipleChoiceAccuracy: 0.800000011920929 metrics/lambada_openai/0-shot/InContextLearningLMAccuracy: 0.7379844784736633 winogrande/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7005 copa/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7788

The datasets reproduce fail are list as following: arc_easy/0-shot/InContextLearningMultipleChoiceAccuracy: 0.4242 arc_challenge/0-shot/InContextLearningMultipleChoiceAccuracy: 0.3579931855201721 copa/0-shot/InContextLearningMultipleChoiceAccuracy: 0.7788 boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.6216381192207336

May 07 '23 23:05 mx8435

It may be because we used this model: https://huggingface.co/huggyllama/llama-7b

I will try to rerun with the model you linked and see how it performs

May 08 '23 15:05 bmosaicml

@bmosaicml , @vchiley I can't reproduce the zero-shot results on boolq either. For both llama-7b and mpt-7b. My two yaml scripts are:

max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  init_device: cpu
  pretrained: true
  pretrained_model_name_or_path: ${model_name_or_path}

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 16

# FSDP config for model sharding
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

For mpt-7b:

seed: 1
model_name_or_path: mosaicml/mpt-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: mpt_causal_lm
  init_device: meta
  pretrained: true
  pretrained_model_name_or_path: ${model_name_or_path}
  config_overrides:
    max_seq_len: ${max_seq_len}
    attn_config:
        attn_impl: triton

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 64

# FSDP config for model sharding
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

And these are run with command: python eval/eval.py eval/yamls/llama_7b_eval.yaml and python eval/eval.py eval/yamls/mpt_7b_eval.yaml I get the following results for zero-shot boolq: For llama-7b:

Ran eval in: 1823.6155874729156 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.49969419836997986

mpt-7b

Ran eval in: 225.67394590377808 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.40948012471199036

Any idea why this could be the case? Can you share the exact command and yaml file you used to generate the results in the table 1 here: https://www.mosaicml.com/blog/mpt-7b?

May 18 '23 03:05 ashim95

This issue should get reopened. I tried evaluating MMLU and got near 25% score with everything. Seems like something in the InContextLearningMultipleChoice stuff is completely broken, but not sure what yet. I'm currently debugging and trying to figure out what is going on, and possibly going to create a simplified version of multiple choice stuff to just confirm there is an issue and see what is going on because I want to evaluate some multiple choice tasks. I was even experimenting with an opt-350m model (and several other models which also didn't work) which scores at least 40% on some tasks and yet with this code it's scoring in the 25% range (broken since there are 4 answers, as good as random), so it doesn't have anything to do with the model I don't think, it seems to be a more general problem.

May 19 '23 17:05 ianupright

@ashim95 We've observed an issue when using FSDP on a single GPU. Could you please try with either not using FSDP (comment out the FSDP section of the config), or run on multiple gpus (with composer ...).

@ianupright What exactly are you running?

May 19 '23 17:05 dakinggg

@dakinggg I ran the mpt-7b model without fsdp with the following config:

seed: 1
model_name_or_path: mosaicml/mpt-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: mpt_causal_lm
  init_device: cpu
  pretrained: true
  pretrained_model_name_or_path: ${model_name_or_path}
  config_overrides:
    max_seq_len: ${max_seq_len}
    attn_config:
        attn_impl: triton

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 64

# FSDP config for model sharding
#fsdp_config:
#  sharding_strategy: FULL_SHARD
#  mixed_precision: PURE

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

and I get the following results:

Eval metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.6211
Ran eval in: 427.96136689186096 seconds
metrics/boolq/0-shot/InContextLearningMultipleChoiceAccuracy: 0.621100902557373

The numbers are still not where they are supposed to be. Could you share the config file and commit id you used to report the numbers?

Thanks,

May 19 '23 18:05 ashim95

I'm running MMLU.

What I don't understand is this collate_fn in InContextLearningMultipleChoiceTaskDataset:

        for choice in choices:
            context_enc = preamble['input_ids'] + context['input_ids']
            continuation_enc = choice['input_ids']
            inp, continuation_span = _make_padded_input(context_enc, continuation_enc, self.max_seq_len,
                                                        self.pad_tok_id)

            inputs.append(inp)
            continuation_indices.append(continuation_span)

It's looping through each choice of A, B, C, D, and creating a new input to be predicted for each case, whereas shouldn't it just be doing one prediction? It's also 4x inefficient/slower to evaluate. I think the prediction failings may have something to do with this, but in any event, it would be good to have an evaluation that is more efficient/simple.

May 19 '23 20:05 ianupright

@ashim95 Ah, you'll want to base off the hf_eval.yaml, not the mpt_eval.yaml, when loading the model from the huggingface hub. the mpt_eval.yaml is designed for loading the model from a composer checkpoint. Sorry for the confusion, we are working on cleaning this up and will have a full eval script out soon. 0.62 is equivalent to the majority baseline for boolq if i recall correctly :)

So, try

max_seq_len: 2048
seed: 1
model_name_or_path: mosaicml/mpt-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 4
precision: fp32

# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: boolq
  dataset_uri: eval/local_data/boolq.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: multiple_choice
  continuation_delimiter: 'Answer: ' # this separates questions from answers

May 19 '23 23:05 dakinggg

@ianupright The MC dataset is paired with an MC metric which takes the most likely of the possible continuations as the predicted answer (https://github.com/mosaicml/composer/blob/6262846fbe4879979a2017d1c98993c05e082a4f/composer/metrics/nlp.py#L414). There are many different possible ways to do MC eval though. What model/command/yaml are you actually running?

May 19 '23 23:05 dakinggg

hello, thanks for your nice work. We can reproduce the score in most datasets except winograd_wsc and winogrande. The score of winograd_wsc is 0.8857142925262451 and the score of winogrande is 0.7004716992378235 in llama-7b. A little diff with the score in paper winograd_wsc 0.807, winogrande 0.675. Can you have a try or give some advice?

The command is "WORLD_SIZE=8 composer eval/eval.py eval/yamls/hf_eval.yaml"

The config is

max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 16

# FSDP config for model sharding
fsdp_config:
    sharding_strategy: FULL_SHARD
    mixed_precision: PURE

icl_tasks:
-
  label: winogrande
  dataset_uri: eval/local_data/winogrande.jsonl
  num_fewshot: [0]
  icl_task_type: schema

May 22 '23 03:05 dyy401453043

and metrics/jeopardy/0-shot/InContextLearningLMAccuracy: 0.36367926001548767, a littile difference with 0.334 in the picture for llama-7b.

command is "WORLD_SIZE=8 composer eval/eval.py eval/yamls/hf_eval.yaml"

config is

max_seq_len: 2048
seed: 1
model_name_or_path: huggyllama/llama-7b

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: ${model_name_or_path}
  init_device: cpu
  pretrained: true

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 16

# FSDP config for model sharding
fsdp_config:
    sharding_strategy: FULL_SHARD
    mixed_precision: PURE

icl_tasks:
-
  label: jeopardy
  dataset_uri: eval/local_data/jeopardy_all.jsonl
  num_fewshot: [0]
  icl_task_type: language_modeling
  continuation_delimiter: 'Answer: '
  has_categories: False

May 22 '23 08:05 dyy401453043

Hi @dyy401453043, those three numbers (llama-7b on winograd, winogrande, and jeopardy) are a mistake in our table, and the numbers you stated are correct. We will update the table shortly, and soon have a full script for reproducing all of the eval scores that we report. Thank you for helping to check our work!

May 23 '23 23:05 dakinggg

llm-foundry llm-foundry copied to clipboard

Reproduce result of Boolq on LLaMA-7B

llm-foundry
llm-foundry copied to clipboard