gpt-pilot
gpt-pilot copied to clipboard
Evaluate against SWE-bench benchmark
Version
Command-line (Python) version
Suggestion
Evaluate against SWE-bench Benchmark: https://github.com/princeton-nlp/SWE-bench
Loooooooool
Loooooooool
no?
@kripper this is a good suggestion and we've looked into it.
SWE-bench is geard toward assistants who work on a small part of a bigger project. We're working from a different starting point - creating full-featured projects from scratch. In creating a full project there are many other difficult challenges (eg software architecture, refactoring, etc) that SWE-bench doesn't cover (fully, or at all).
As a consequence, currently we don't support the workflow that SWE-bench assumes.
Being able to take over an existing project is something we're currently working on, so in the future we'll also be able to support that use case and, as a result, be able to compare using SWE-bench.