LeanDojo Buggy Benchmark Generation

Buggy Benchmark Generation

Open Adarsh321123 opened this issue 7 months ago • 10 comments

Description

I believe the scripts/generate-benchmark-lean4.ipynb is buggy. I evaluated the performance of the ReProver premise retriever on a dataset I generated from Mathlib4. I should get a recall of around 34.7 according to the paper. When I generated the dataset on the commit 29dcec074de168ac2bf835a77ef68bbe069194c5, my recall was absurdly high at around 70, consistent with other commits I tried with this generation code. However, the downloaded dataset v9 at https://zenodo.org/records/10929138 had a recall of 36, as expected. I am unsure what commit version v9 was on, but regardless I believe that this recall disparity suggests buggy code.

Detailed Steps to Reproduce the Behavior

Run scripts/generate-benchmark-lean4.ipynb on commit 29dcec074de168ac2bf835a77ef68bbe069194c5 to reproduce LeanDojo Benchmark 4 version v9.
Clone ReProver and follow its steps for the evaluation of the premise retriever checkpoint on that dataset.
Evaluate the same checkpoint on https://zenodo.org/records/10929138.

Logs

Downloaded dataset:

Average R@1 = 13.057546943693088 %, R@10 = 35.968246863464174 %, MRR = 0.31899176730322715

Generated dataset:

Average R@1 = 28.444059618579143 %, R@10 = 69.65759521602048 %, MRR = 0.5975602059714626

Also, there are some differences between the two datasets. With this code,

import json
from pathlib import Path

downloaded_dataset_path = 'leandojo_benchmark_4_downloaded/random'
new_commit_dataset_path = 'leandojo_benchmark_4_new_commit/random'
downloaded_corpus_path = 'leandojo_benchmark_4_downloaded/corpus.jsonl'
new_commit_corpus_path = 'leandojo_benchmark_4_new_commit/corpus.jsonl'

def load_jsonl(file_path):
    with open(file_path, 'r') as file:
        print(f"Loading {file_path}")
        return [json.loads(line) for line in file]

def load_json(file_path):
    datasets = {}
    for split in ["train", "val", "test"]:
        new_file_path = f"{file_path}/{split}.json"
        with open(new_file_path, 'r') as file:
            print(f"Loading {new_file_path}")
            datasets[split] = json.load(file)
    return datasets

downloaded_dataset = load_json(downloaded_dataset_path)
downloaded_dataset_train, downloaded_dataset_val, downloaded_dataset_test = downloaded_dataset["train"], downloaded_dataset["val"], downloaded_dataset["test"]
new_commit_dataset = load_json(new_commit_dataset_path)
new_commit_dataset_train, new_commit_dataset_val, new_commit_dataset_test = new_commit_dataset["train"], new_commit_dataset["val"], new_commit_dataset["test"]
downloaded_corpus = load_jsonl(downloaded_corpus_path)
new_commit_corpus = load_jsonl(new_commit_corpus_path)

analysis_results = {
    "downloaded_dataset_train_size": len(downloaded_dataset_train),
    "downloaded_dataset_val_size": len(downloaded_dataset_val),
    "downloaded_dataset_test_size": len(downloaded_dataset_test),
    "new_commit_dataset_train_size": len(new_commit_dataset_train),
    "new_commit_dataset_val_size": len(new_commit_dataset_val),
    "new_commit_dataset_test_size": len(new_commit_dataset_test),
    "downloaded_corpus_size": len(downloaded_corpus),
    "new_commit_corpus_size": len(new_commit_corpus),
    "downloaded_corpus_file_premises_size": len(downloaded_corpus[0]['premises']),
    "new_commit_corpus_file_premises_size": len(new_commit_corpus[0]['premises']),
}

print(analysis_results)

I get this final output

{
'downloaded_dataset_train_size': 112729,
'downloaded_dataset_val_size': 2000, 
'downloaded_dataset_test_size': 2000,
'new_commit_dataset_train_size': 118517,
'new_commit_dataset_val_size': 2000,
'new_commit_dataset_test_size': 2000,
'downloaded_corpus_size': 5192,
'new_commit_corpus_size': 5674,
'downloaded_corpus_file_premises_size': 465,
'new_commit_corpus_file_premises_size': 478
}

Platform Information

OS: Ubuntu
LeanDojo Versio: 1.8.4

Jul 16 '24 06:07 Adarsh321123

LeanDojo LeanDojo copied to clipboard

Buggy Benchmark Generation

LeanDojo
LeanDojo copied to clipboard