feat: started working on SWE-bench evals
Implemented with gptme, given moatless-tools and aider as reference implementations.
- [x] Set up harness
- [ ] Get a single eval instance passing
- [ ] Gets stuck at installing deps for repos
- moatless-tools don't support running tests?
- aider depends on the Docker env?
- [ ] Gets stuck at installing deps for repos
- [ ] Try making our own eval instance?
[!IMPORTANT] Introduces SWE-bench evaluation framework in
gptmewith new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.
- New Features:
- Introduces SWE-bench evaluation framework in
gptme/eval/swebench.- Implements
run_swebench_evaluation()inevaluate.pyto evaluate instances using anAgent.- Adds CLI command in
main.pyfor running evaluations with options for model, dataset, split, instance, and verbosity.- Utilities:
utils.pyprovides functions for loading instances, setting up repositories, and extracting file spans from patches.- Configuration:
- Adds
gptme-eval-swebenchscript entry inpyproject.toml.- Adds
datasetsandfsspecas dependencies inpyproject.toml.This description was created by
for 4e9b48a07eaa8ba54c35e3a23156aac2ae4656aa. It will automatically update as commits are pushed.
Codecov Report
Attention: Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.
Project coverage is 77.15%. Comparing base (
81708e6) to head (4e9b48a).
:white_check_mark: All tests successful. No failed tests found.
Additional details and impacted files
@@ Coverage Diff @@
## master #142 +/- ##
==========================================
- Coverage 80.63% 77.15% -3.49%
==========================================
Files 52 57 +5
Lines 3145 3287 +142
==========================================
Hits 2536 2536
- Misses 609 751 +142
| Flag | Coverage Δ | |
|---|---|---|
| anthropic/claude-3-haiku-20240307 | 76.11% <0.00%> (-3.44%) |
:arrow_down: |
| openai/gpt-4o-mini | 75.84% <0.00%> (-3.43%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet
I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.
Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.
I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra
Might also integrate https://swe-rex.com/latest/ which seems pretty useful
My branch is a giant mess atm though 😭
@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes :pray:
I also just found SWE-Gym: https://arxiv.org/abs/2412.21139
I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc.