open_llama icon indicating copy to clipboard operation
open_llama copied to clipboard

Could not reproduce the evaluation results

Open guanqun-yang opened this issue 2 years ago • 11 comments

Hi,

I am trying to reproduce your reported numbers using the command provided by LM Evaluation Harness. One of the commands look like following:

python main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_3b,use_accelerate=True,dtype=half \
--tasks  arc_challenge \
--batch_size 8 \
--num_fewshots 25 \
--write_out

which gave me an directory of .json files that look like the following:

├── arc_challenge_write_out_info.json
├── hellaswag_write_out_info.json
└── truthfulqa_mc_write_out_info.json

I tried to compute the final results using the script below but found that the numbers I obtained were quite different from what you reported. I don't know which part went wrong.

model arc_challenge hellaswag truthfulqa_mc
openlm-research/open_llama_3b 0.260239 0.25941 0.487843
openlm-research/open_llama_7b 0.261092 0.262298 0.483711

Here is the script I used to create the table:

task_dict = \
{'arc_challenge': {'metric': 'acc_norm',
                   'shot': 25,
                   'task_name': 'arc_challenge'},
 'hellaswag': {'metric': 'acc_norm', 'shot': 10, 'task_name': 'hellaswag'},
 'truthfulqa_mc': {'metric': 'mc2', 'shot': 0, 'task_name': 'truthfulqa_mc'}}

models = [
    "openlm-research/open_llama_3b",
    "openlm-research/open_llama_7b",
]

records = list()
for model in tqdm(models):
    for task, d in task_dict.items():
        task_name = d["task_name"]
        metric_name = d["metric"]

        df = pd.read_json(f"results/{model}/{task_name}.json")
        records.append(
            {
                "model": model,
                "task": task,
                "metric_name": metric_name,
                "metric": df[metric_name].mean(),
            }
        )

stat_df = pd.DataFrame(records)
stat_df = pd.pivot_table(stat_df, index="model", columns="task", values="metric")

print(stat_df.to_markdown())

guanqun-yang avatar Jun 11 '23 02:06 guanqun-yang

We believe that this is related to the HF fast tokenizer problem mentioned in the readme here. You'll need to avoid using the auto-converted fast tokenizer to get correct tokenization.

That being said our evaluation is ran in JAX instead of PyTorch. You can follow the evaluation doc of our framework to reproduce our evaluation.

young-geng avatar Jun 12 '23 17:06 young-geng

Thank you for your prompt response @young-geng! But after correcting the said mistake to the expected use_fast=False and rerunning the entire evaluation gave me the same near-random results (around 25%), still quite different from what you reported.

I am unsure if your process of converting from the JAX format to the torch format is airtight or not.

guanqun-yang avatar Jun 13 '23 02:06 guanqun-yang

@guanqun-yang @young-geng did you use the few-shot zero? I used the default 0 fewshot, the results is almost same.

chi2liu avatar Jun 13 '23 06:06 chi2liu

@guanqun-yang I just ran 25 shot arc_challenge with use_fast=False, and here's my result:

Task Version Metric Value Stderr
arc_challenge 0 acc 0.4369 ± 0.0145
acc_norm 0.4735 ± 0.0146

Here's my result for 10 shots hellaswag:

Task Version Metric Value Stderr
hellaswag 0 acc 0.5358 ± 0.0050
acc_norm 0.7205 ± 0.0045

These results do match the evaluation we did in JAX. I've also numerically compared the JAX model with the PyTorch model and the logits output matches pretty well (around 1e-8 error on CPU, the error is higher on GPU depends on the precision used).

young-geng avatar Jun 13 '23 09:06 young-geng

@chi2liu I am trying to reproduce the numbers of Open LLM Benchmark, which specifies the number of shots for each task.

guanqun-yang avatar Jun 13 '23 15:06 guanqun-yang

@young-geng Thank you for reproducing the results! Did you try to obtain the write-out .json files? I computed the metrics based on those files.

guanqun-yang avatar Jun 13 '23 15:06 guanqun-yang

@guanqun-yang I did not save those json files, but I did use the same 25 shots for arc_challenge and 10 shots for hellaswag, which is the same for Open LLM Leaderboard. I just realized that you are evaluating the 3b model and I was evaluating the 7b model. Let me also try the 3b model.

young-geng avatar Jun 13 '23 16:06 young-geng

Here's the results for the 3b model:

Task Version Metric Value Stderr
arc_challenge 0 acc 0.3686 ± 0.0141
acc_norm 0.4096 ± 0.0144
Task Version Metric Value Stderr
hellaswag 0 acc 0.4956 ± 0.0050
acc_norm 0.6681 ± 0.0047

young-geng avatar Jun 14 '23 10:06 young-geng

@young-geng Thank you for reproducing the evaluations! It could be something subtle that causes the issues. Let me double check the report back here. Also, are you using the same command as I did or something different?

guanqun-yang avatar Jun 14 '23 20:06 guanqun-yang

@young-geng It seems that I have located the issue. I am able to reproduce the reported number using the command:

python ../lm-evaluation-harness/main.py \
--model hf-causal-experimental \
--model_args pretrained=openlm-research/open_llama_7b,use_accelerate=True,dtype=half \
--batch_size 16 \
--tasks arc_challenge \
--num_fewshot 25 \
--write_out \
--output_base_path <path>

Here is what I believe that caused the difference:

  • What I did was first download the model somewhere and then load it from that custom directory.
  • But you were letting the transformers to handle the downloading and load the model from $HF_HOME.

though this difference may sound unlikely to cause the difference.

guanqun-yang avatar Jun 15 '23 20:06 guanqun-yang

:detective: @guanqun-yang -- Looks like the original command you posted has num_fewshots and the above one has num_fewshot

currents-abhishek avatar Jul 10 '23 11:07 currents-abhishek