gpt-engineer
gpt-engineer copied to clipboard
Automatic benchmarking of gpt-engineer with swe-bench
Feature description
We have a way to easily add benchmarks:
https://www.loom.com/share/206805143fbb4302b5455a5329eaab17?sid=f689608f-8e49-44f7-b55f-4c81e9dc93e6
This issue is about looking into if swe-bench is a good benchmark to add and then add a simple version of it.
Tempted to prioritize this higher after the Devin announcement (just as @batwood001 in #1062).
Makes sense. Let's figure it out this Thursday at our tech planning meeting and the availability of people.
@viborc can you assign this to me?
@viborc can you assign this to me?
Done!
This is more of a general update to the community than anything else. The work on this issue is ongoing, and @Mohit-Dhawan98 is working on it with @ATheorell's support. We'll likely have SWE bench support in the near future!
Someone from the OpenDevin suggested we might look into their work here and possibly learn from it and re-use if needed. Putting this here for our reference: https://github.com/OpenDevin/OpenDevin/tree/main/evaluation/swe_bench