llm-foundry
llm-foundry copied to clipboard
Evaluation result mismatch
I tried to run the 0-shot evaluation on winograd for the MPT-7b model. This is the result that I got: Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.5055
This is the script that I use:
python eval/eval.py
eval/yamls/hf_eval.yaml
icl_tasks=eval/yamls/winograd.yaml
model_name_or_path=mosaicml/mpt-7b
The reported number for 0-shot winograd for MPT-7b is 0.878. Is there anything missing here?
I just had success running composer eval/eval.p YAML_NAME.yaml
with the following YAML:
seed: 1
max_seq_len: 1024
device_eval_batch_size: 4
fsdp_config:
mixed_precision: PURE
sharding_strategy: FULL_SHARD
icl_tasks:
-
label: winograd
dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0, 1, 5, 10]
icl_task_type: schema
model:
device: cpu
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: mosaicml/mpt-7b
use_auth_token: false
tokenizer:
kwargs:
model_max_length: ${max_seq_len}
name: mosaicml/mpt-7b
Can you try this and let men now how it goes?
Thanks a lot, I can get similar results right now:
Eval metrics/winograd/0-shot/InContextLearningMultipleChoiceAccuracy: 0.8679
The config has some space issues, a fixed version is provided here:
seed: 1
max_seq_len: 1024
device_eval_batch_size: 4
fsdp_config:
mixed_precision: PURE
sharding_strategy: FULL_SHARD
icl_tasks:
-
label: winograd
dataset_uri: eval/local_data/winograd_wsc.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
icl_task_type: schema
model:
device: cpu
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: mosaicml/mpt-7b
use_auth_token: false
tokenizer:
name: mosaicml/mpt-7b
kwargs:
model_max_length: ${max_seq_len}
Do you have any idea why there is such a large difference between the results obtained from running 'python eval/eval.py' and 'composer eval/eval.py'?
What setup are you using? is it multi-gpu?
composer SCRIPT
will launch the script across multiple GPUs; composer SCRIPT
will not, but if there are multiple gpus in your setup the script will assume you have launched on multiple GPUs and something might break.
If you are using a multi-gpu setup that might be part of the difference.
(I'm not sure, I'm just throwing out idea's)
I tried both multi-gpu and single-gpu for "python eval.py". The issue is the same. So it should not due to the setup of using multi-gpu.
Hi @congyingxia, in order to do evaluation in a data parallel manner, the dataset size is padded to be divisible by the world size, which results in duplicating a small number of samples. For most datasets, the difference from this should be very small, because the number of duplicated samples is very small relative to the full dataset size. Winograd is a very small dataset and so may see a larger difference. We ran all of our evals on 32 gpus for fair comparison between models, and are also fixing this duplication issue in composer (https://github.com/mosaicml/composer/pull/2218). cc @bmosaicml
Thanks for your clarification. Is this the reason for the difference between running 'python eval/eval.py' and 'composer eval/eval.py'? python eval/eval.py is using the exact samples in the dataset while composer eval/eval.py will duplicate samples to pad the dataset size?
@congyingxia Yes that is my understanding. In the current v0.1.1 codebase, prior to the fixes coming in https://github.com/mosaicml/composer/pull/2218, using multiple GPUs for eval with any dataset (streaming or HF or a standard map-style dataset), will duplicate 0 <= K < WORLD_SIZE samples to maintain the same number of samples per-GPU.
Got it, thanks!