llm-foundry Evaluation result mismatch

I tried to run the 0-shot evaluation on winograd for the MPT-7b model. This is the result that I got: Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.5055

This is the script that I use: python eval/eval.py
eval/yamls/hf_eval.yaml
icl_tasks=eval/yamls/winograd.yaml
model_name_or_path=mosaicml/mpt-7b

The reported number for 0-shot winograd for MPT-7b is 0.878. Is there anything missing here?

May 10 '23 02:05 congyingxia

I just had success running composer eval/eval.p YAML_NAME.yaml

with the following YAML:

seed: 1
  max_seq_len: 1024
  device_eval_batch_size: 4

  fsdp_config:
    mixed_precision: PURE
    sharding_strategy: FULL_SHARD
  icl_tasks:
    -
      label: winograd
      dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
      num_fewshot: [0, 1, 5, 10]
      icl_task_type: schema

  model:
      device: cpu
      name: hf_causal_lm
      pretrained: true
      pretrained_model_name_or_path: mosaicml/mpt-7b
      use_auth_token: false
  
  tokenizer:
      kwargs:
        model_max_length: ${max_seq_len}
      name: mosaicml/mpt-7b

Can you try this and let men now how it goes?

May 10 '23 03:05 bmosaicml

Thanks a lot, I can get similar results right now:

Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.8679

The config has some space issues, a fixed version is provided here:

seed: 1
max_seq_len: 1024
device_eval_batch_size: 4

fsdp_config:
  mixed_precision: PURE
  sharding_strategy: FULL_SHARD

icl_tasks:
-
  label: winograd
  dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: schema

model:
  device: cpu
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b
  use_auth_token: false

tokenizer:
  name: mosaicml/mpt-7b
  kwargs:
    model_max_length: ${max_seq_len}

May 10 '23 03:05 congyingxia

Do you have any idea why there is such a large difference between the results obtained from running 'python eval/eval.py' and 'composer eval/eval.py'?

May 10 '23 03:05 congyingxia

What setup are you using? is it multi-gpu? composer SCRIPT will launch the script across multiple GPUs; composer SCRIPT will not, but if there are multiple gpus in your setup the script will assume you have launched on multiple GPUs and something might break. If you are using a multi-gpu setup that might be part of the difference. (I'm not sure, I'm just throwing out idea's)

May 10 '23 04:05 vchiley

I tried both multi-gpu and single-gpu for "python eval.py". The issue is the same. So it should not due to the setup of using multi-gpu.

May 10 '23 05:05 congyingxia

Hi @congyingxia, in order to do evaluation in a data parallel manner, the dataset size is padded to be divisible by the world size, which results in duplicating a small number of samples. For most datasets, the difference from this should be very small, because the number of duplicated samples is very small relative to the full dataset size. Winograd is a very small dataset and so may see a larger difference. We ran all of our evals on 32 gpus for fair comparison between models, and are also fixing this duplication issue in composer (https://github.com/mosaicml/composer/pull/2218). cc @bmosaicml

May 12 '23 17:05 dakinggg

Thanks for your clarification. Is this the reason for the difference between running 'python eval/eval.py' and 'composer eval/eval.py'? python eval/eval.py is using the exact samples in the dataset while composer eval/eval.py will duplicate samples to pad the dataset size?

May 16 '23 19:05 congyingxia

@congyingxia Yes that is my understanding. In the current v0.1.1 codebase, prior to the fixes coming in https://github.com/mosaicml/composer/pull/2218, using multiple GPUs for eval with any dataset (streaming or HF or a standard map-style dataset), will duplicate 0 <= K < WORLD_SIZE samples to maintain the same number of samples per-GPU.

May 17 '23 22:05 abhi-mosaic

Got it, thanks!

May 18 '23 17:05 congyingxia

llm-foundry llm-foundry copied to clipboard

Evaluation result mismatch

llm-foundry
llm-foundry copied to clipboard