llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Evaluation result mismatch

Open congyingxia opened this issue 1 year ago • 7 comments

I tried to run the 0-shot evaluation on winograd for the MPT-7b model. This is the result that I got: Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.5055

This is the script that I use: python eval/eval.py
eval/yamls/hf_eval.yaml
icl_tasks=eval/yamls/winograd.yaml
model_name_or_path=mosaicml/mpt-7b

The reported number for 0-shot winograd for MPT-7b is 0.878. Is there anything missing here?

image

congyingxia avatar May 10 '23 02:05 congyingxia

I just had success running composer eval/eval.p YAML_NAME.yaml

with the following YAML:

seed: 1
  max_seq_len: 1024
  device_eval_batch_size: 4

  fsdp_config:
    mixed_precision: PURE
    sharding_strategy: FULL_SHARD
  icl_tasks:
    -
      label: winograd
      dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
      num_fewshot: [0, 1, 5, 10]
      icl_task_type: schema

  model:
      device: cpu
      name: hf_causal_lm
      pretrained: true
      pretrained_model_name_or_path: mosaicml/mpt-7b
      use_auth_token: false
  
  tokenizer:
      kwargs:
        model_max_length: ${max_seq_len}
      name: mosaicml/mpt-7b

Can you try this and let men now how it goes?

bmosaicml avatar May 10 '23 03:05 bmosaicml

Thanks a lot, I can get similar results right now:

Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.8679

The config has some space issues, a fixed version is provided here:

seed: 1
max_seq_len: 1024
device_eval_batch_size: 4

fsdp_config:
  mixed_precision: PURE
  sharding_strategy: FULL_SHARD

icl_tasks:
-
  label: winograd
  dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [0]
  icl_task_type: schema

model:
  device: cpu
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b
  use_auth_token: false

tokenizer:
  name: mosaicml/mpt-7b
  kwargs:
    model_max_length: ${max_seq_len}

congyingxia avatar May 10 '23 03:05 congyingxia

Do you have any idea why there is such a large difference between the results obtained from running 'python eval/eval.py' and 'composer eval/eval.py'?

congyingxia avatar May 10 '23 03:05 congyingxia

What setup are you using? is it multi-gpu? composer SCRIPT will launch the script across multiple GPUs; composer SCRIPT will not, but if there are multiple gpus in your setup the script will assume you have launched on multiple GPUs and something might break. If you are using a multi-gpu setup that might be part of the difference. (I'm not sure, I'm just throwing out idea's)

vchiley avatar May 10 '23 04:05 vchiley

I tried both multi-gpu and single-gpu for "python eval.py". The issue is the same. So it should not due to the setup of using multi-gpu.

congyingxia avatar May 10 '23 05:05 congyingxia

Hi @congyingxia, in order to do evaluation in a data parallel manner, the dataset size is padded to be divisible by the world size, which results in duplicating a small number of samples. For most datasets, the difference from this should be very small, because the number of duplicated samples is very small relative to the full dataset size. Winograd is a very small dataset and so may see a larger difference. We ran all of our evals on 32 gpus for fair comparison between models, and are also fixing this duplication issue in composer (https://github.com/mosaicml/composer/pull/2218). cc @bmosaicml

dakinggg avatar May 12 '23 17:05 dakinggg

Thanks for your clarification. Is this the reason for the difference between running 'python eval/eval.py' and 'composer eval/eval.py'? python eval/eval.py is using the exact samples in the dataset while composer eval/eval.py will duplicate samples to pad the dataset size?

congyingxia avatar May 16 '23 19:05 congyingxia

@congyingxia Yes that is my understanding. In the current v0.1.1 codebase, prior to the fixes coming in https://github.com/mosaicml/composer/pull/2218, using multiple GPUs for eval with any dataset (streaming or HF or a standard map-style dataset), will duplicate 0 <= K < WORLD_SIZE samples to maintain the same number of samples per-GPU.

abhi-mosaic avatar May 17 '23 22:05 abhi-mosaic

Got it, thanks!

congyingxia avatar May 18 '23 17:05 congyingxia