gpt-pilot Evaluate against SWE-bench benchmark

Evaluate against SWE-bench benchmark

Open kripper opened this issue 3 months ago • 3 comments

Version

Command-line (Python) version

Suggestion

Evaluate against SWE-bench Benchmark: https://github.com/princeton-nlp/SWE-bench

Mar 22 '24 04:03 kripper

Loooooooool

Mar 22 '24 05:03 luyandadhlamini2

Loooooooool

no?

Mar 22 '24 05:03 kripper

@kripper this is a good suggestion and we've looked into it.

SWE-bench is geard toward assistants who work on a small part of a bigger project. We're working from a different starting point - creating full-featured projects from scratch. In creating a full project there are many other difficult challenges (eg software architecture, refactoring, etc) that SWE-bench doesn't cover (fully, or at all).

As a consequence, currently we don't support the workflow that SWE-bench assumes.

Being able to take over an existing project is something we're currently working on, so in the future we'll also be able to support that use case and, as a result, be able to compare using SWE-bench.

Mar 22 '24 05:03 senko

gpt-pilot gpt-pilot copied to clipboard

Evaluate against SWE-bench benchmark

Version

Suggestion

gpt-pilot
gpt-pilot copied to clipboard