gpt-engineer icon indicating copy to clipboard operation
gpt-engineer copied to clipboard

Automatic benchmarking of gpt-engineer with swe-bench

Open AntonOsika opened this issue 1 year ago • 6 comments

Feature description

We have a way to easily add benchmarks:

https://www.loom.com/share/206805143fbb4302b5455a5329eaab17?sid=f689608f-8e49-44f7-b55f-4c81e9dc93e6

This issue is about looking into if swe-bench is a good benchmark to add and then add a simple version of it.

AntonOsika avatar Dec 18 '23 14:12 AntonOsika

Tempted to prioritize this higher after the Devin announcement (just as @batwood001 in #1062).

ErikBjare avatar Mar 13 '24 10:03 ErikBjare

Makes sense. Let's figure it out this Thursday at our tech planning meeting and the availability of people.

viborc avatar Mar 13 '24 10:03 viborc

@viborc can you assign this to me?

Mohit-Dhawan98 avatar Mar 28 '24 18:03 Mohit-Dhawan98

@viborc can you assign this to me?

Done!

viborc avatar Mar 28 '24 18:03 viborc

This is more of a general update to the community than anything else. The work on this issue is ongoing, and @Mohit-Dhawan98 is working on it with @ATheorell's support. We'll likely have SWE bench support in the near future!

viborc avatar May 04 '24 09:05 viborc

Someone from the OpenDevin suggested we might look into their work here and possibly learn from it and re-use if needed. Putting this here for our reference: https://github.com/OpenDevin/OpenDevin/tree/main/evaluation/swe_bench

viborc avatar Jul 18 '24 17:07 viborc