LeanDojo
LeanDojo copied to clipboard
Buggy Benchmark Generation
Description
I believe the scripts/generate-benchmark-lean4.ipynb
is buggy. I evaluated the performance of the ReProver premise retriever on a dataset I generated from Mathlib4. I should get a recall of around 34.7 according to the paper. When I generated the dataset on the commit 29dcec074de168ac2bf835a77ef68bbe069194c5
, my recall was absurdly high at around 70, consistent with other commits I tried with this generation code. However, the downloaded dataset v9 at https://zenodo.org/records/10929138 had a recall of 36, as expected. I am unsure what commit version v9 was on, but regardless I believe that this recall disparity suggests buggy code.
Detailed Steps to Reproduce the Behavior
- Run
scripts/generate-benchmark-lean4.ipynb
on commit29dcec074de168ac2bf835a77ef68bbe069194c5
to reproduce LeanDojo Benchmark 4 version v9. - Clone ReProver and follow its steps for the evaluation of the premise retriever checkpoint on that dataset.
- Evaluate the same checkpoint on https://zenodo.org/records/10929138.
Logs
Downloaded dataset:
Average R@1 = 13.057546943693088 %, R@10 = 35.968246863464174 %, MRR = 0.31899176730322715
Generated dataset:
Average R@1 = 28.444059618579143 %, R@10 = 69.65759521602048 %, MRR = 0.5975602059714626
Also, there are some differences between the two datasets. With this code,
import json
from pathlib import Path
downloaded_dataset_path = 'leandojo_benchmark_4_downloaded/random'
new_commit_dataset_path = 'leandojo_benchmark_4_new_commit/random'
downloaded_corpus_path = 'leandojo_benchmark_4_downloaded/corpus.jsonl'
new_commit_corpus_path = 'leandojo_benchmark_4_new_commit/corpus.jsonl'
def load_jsonl(file_path):
with open(file_path, 'r') as file:
print(f"Loading {file_path}")
return [json.loads(line) for line in file]
def load_json(file_path):
datasets = {}
for split in ["train", "val", "test"]:
new_file_path = f"{file_path}/{split}.json"
with open(new_file_path, 'r') as file:
print(f"Loading {new_file_path}")
datasets[split] = json.load(file)
return datasets
downloaded_dataset = load_json(downloaded_dataset_path)
downloaded_dataset_train, downloaded_dataset_val, downloaded_dataset_test = downloaded_dataset["train"], downloaded_dataset["val"], downloaded_dataset["test"]
new_commit_dataset = load_json(new_commit_dataset_path)
new_commit_dataset_train, new_commit_dataset_val, new_commit_dataset_test = new_commit_dataset["train"], new_commit_dataset["val"], new_commit_dataset["test"]
downloaded_corpus = load_jsonl(downloaded_corpus_path)
new_commit_corpus = load_jsonl(new_commit_corpus_path)
analysis_results = {
"downloaded_dataset_train_size": len(downloaded_dataset_train),
"downloaded_dataset_val_size": len(downloaded_dataset_val),
"downloaded_dataset_test_size": len(downloaded_dataset_test),
"new_commit_dataset_train_size": len(new_commit_dataset_train),
"new_commit_dataset_val_size": len(new_commit_dataset_val),
"new_commit_dataset_test_size": len(new_commit_dataset_test),
"downloaded_corpus_size": len(downloaded_corpus),
"new_commit_corpus_size": len(new_commit_corpus),
"downloaded_corpus_file_premises_size": len(downloaded_corpus[0]['premises']),
"new_commit_corpus_file_premises_size": len(new_commit_corpus[0]['premises']),
}
print(analysis_results)
I get this final output
{
'downloaded_dataset_train_size': 112729,
'downloaded_dataset_val_size': 2000,
'downloaded_dataset_test_size': 2000,
'new_commit_dataset_train_size': 118517,
'new_commit_dataset_val_size': 2000,
'new_commit_dataset_test_size': 2000,
'downloaded_corpus_size': 5192,
'new_commit_corpus_size': 5674,
'downloaded_corpus_file_premises_size': 465,
'new_commit_corpus_file_premises_size': 478
}
Platform Information
- OS: Ubuntu
- LeanDojo Versio: 1.8.4