gptme feat: started working on SWE-bench evals

Implemented with gptme, given moatless-tools and aider as reference implementations.

[x] Set up harness
[ ] Get a single eval instance passing
- [ ] Gets stuck at installing deps for repos
  - moatless-tools don't support running tests?
  - aider depends on the Docker env?
[ ] Try making our own eval instance?

[!IMPORTANT] Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

New Features:

Introduces SWE-bench evaluation framework in gptme/eval/swebench.

Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.

Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.

Utilities:

utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.

Configuration:

Adds gptme-eval-swebench script entry in pyproject.toml.

Adds datasets and fsspec as dependencies in pyproject.toml.

^{This description was created by}^{for 4e9b48a07eaa8ba54c35e3a23156aac2ae4656aa. It will automatically update as commits are pushed.}

Sep 30 '24 07:09 ErikBjare

Codecov Report

Attention: Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Project coverage is 77.15%. Comparing base (81708e6) to head (4e9b48a).

:white_check_mark: All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
gptme/eval/swebench/evaluate.py	0.00%	62 Missing :warning:
gptme/eval/swebench/utils.py	0.00%	47 Missing :warning:
gptme/eval/swebench/main.py	0.00%	29 Missing :warning:
gptme/eval/swebench/__init__.py	0.00%	3 Missing :warning:
gptme/eval/swebench/__main__.py	0.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   80.63%   77.15%   -3.49%     
==========================================
  Files          52       57       +5     
  Lines        3145     3287     +142     
==========================================
  Hits         2536     2536              
- Misses        609      751     +142

Flag	Coverage Δ
anthropic/claude-3-haiku-20240307	`76.11% <0.00%> (-3.44%)`	:arrow_down:
openai/gpt-4o-mini	`75.84% <0.00%> (-3.43%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 30 '24 07:09 codecov-commenter

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

Nov 01 '24 21:11 ErikBjare

I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra

Might also integrate https://swe-rex.com/latest/ which seems pretty useful

My branch is a giant mess atm though 😭

Dec 27 '24 09:12 bjsi

@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes :pray:

Jan 14 '25 15:01 ErikBjare

I also just found SWE-Gym: https://arxiv.org/abs/2412.21139

Jan 14 '25 16:01 ErikBjare

I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc.

Jan 14 '25 16:01 bjsi