gpt-engineer icon indicating copy to clipboard operation
gpt-engineer copied to clipboard

Automatic benchmarking of gpt-engineer with APPS

Open ATheorell opened this issue 1 year ago • 2 comments

Feature description

gpt-engineer has an automatic evals suite in "evals/eval_new_code.py". However, only 2 test cases are given in evals/new_code_eval.yaml . An alternative to filling in more testcases manually, we should parse in prompts and tests from the (very large) APPS dataset (https://paperswithcode.com/dataset/apps).

Since APPS is way too large to run in its entirety, there should be functionality to run n randomly selected tests and run n tests according to some predetermined test ordering (so that consecutive benchmark runs are comparable).

The APPS database should not be added to the gpt-engineer git repo! Probably the best way to handle this is to pull it from huggingface (https://huggingface.co/datasets/codeparrot/apps) in the code itself (potentially caching it and gitignoring it so it doesn't need to be pulled on every run).

Motivation/Application

Automatic benchmarking is the ideal way to determine whether an imposed change to the code base is advantageous.

ATheorell avatar Oct 23 '23 09:10 ATheorell

I can add a sampled version of the APPS dataset which will give us a good idea how our project is doing without costing a fortune. APPS is a great at testing how well our repair of broken code works.

pbharrin avatar Oct 23 '23 18:10 pbharrin

@ATheorell assign to me

azrv avatar Feb 01 '24 20:02 azrv

Yes, please a shot shot at this @azrv :)

ATheorell avatar Feb 02 '24 18:02 ATheorell

https://github.com/gpt-engineer-org/gpt-engineer/pull/1051 was merged 🎉

@pbharrin It's now a matter of cherry-picking problems we want to constantly test against.

gpt_engineer/benchmark/benchmarks/apps/problems.py:4

azrv avatar Mar 23 '24 12:03 azrv