gpt-engineer
gpt-engineer copied to clipboard
Automatic benchmarking of gpt-engineer with APPS
Feature description
gpt-engineer has an automatic evals suite in "evals/eval_new_code.py". However, only 2 test cases are given in evals/new_code_eval.yaml . An alternative to filling in more testcases manually, we should parse in prompts and tests from the (very large) APPS dataset (https://paperswithcode.com/dataset/apps).
Since APPS is way too large to run in its entirety, there should be functionality to run n randomly selected tests and run n tests according to some predetermined test ordering (so that consecutive benchmark runs are comparable).
The APPS database should not be added to the gpt-engineer git repo! Probably the best way to handle this is to pull it from huggingface (https://huggingface.co/datasets/codeparrot/apps) in the code itself (potentially caching it and gitignoring it so it doesn't need to be pulled on every run).
Motivation/Application
Automatic benchmarking is the ideal way to determine whether an imposed change to the code base is advantageous.
I can add a sampled version of the APPS dataset which will give us a good idea how our project is doing without costing a fortune. APPS is a great at testing how well our repair of broken code works.
@ATheorell assign to me
Yes, please a shot shot at this @azrv :)
https://github.com/gpt-engineer-org/gpt-engineer/pull/1051 was merged 🎉
@pbharrin It's now a matter of cherry-picking problems we want to constantly test against.