gptme icon indicating copy to clipboard operation
gptme copied to clipboard

feat: started working on SWE-bench evals

Open ErikBjare opened this issue 1 year ago • 6 comments

Implemented with gptme, given moatless-tools and aider as reference implementations.

  • [x] Set up harness
  • [ ] Get a single eval instance passing
    • [ ] Gets stuck at installing deps for repos
      • moatless-tools don't support running tests?
      • aider depends on the Docker env?
  • [ ] Try making our own eval instance?

[!IMPORTANT] Introduces SWE-bench evaluation framework in gptme with new modules for instance loading, repository setup, and evaluation execution, along with CLI support and updated dependencies.

  • New Features:
    • Introduces SWE-bench evaluation framework in gptme/eval/swebench.
    • Implements run_swebench_evaluation() in evaluate.py to evaluate instances using an Agent.
    • Adds CLI command in main.py for running evaluations with options for model, dataset, split, instance, and verbosity.
  • Utilities:
    • utils.py provides functions for loading instances, setting up repositories, and extracting file spans from patches.
  • Configuration:
    • Adds gptme-eval-swebench script entry in pyproject.toml.
    • Adds datasets and fsspec as dependencies in pyproject.toml.

This description was created by Ellipsis for 4e9b48a07eaa8ba54c35e3a23156aac2ae4656aa. It will automatically update as commits are pushed.

ErikBjare avatar Sep 30 '24 07:09 ErikBjare

Codecov Report

Attention: Patch coverage is 0% with 142 lines in your changes missing coverage. Please review.

Project coverage is 77.15%. Comparing base (81708e6) to head (4e9b48a).

:white_check_mark: All tests successful. No failed tests found.

Files with missing lines Patch % Lines
gptme/eval/swebench/evaluate.py 0.00% 62 Missing :warning:
gptme/eval/swebench/utils.py 0.00% 47 Missing :warning:
gptme/eval/swebench/main.py 0.00% 29 Missing :warning:
gptme/eval/swebench/__init__.py 0.00% 3 Missing :warning:
gptme/eval/swebench/__main__.py 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   80.63%   77.15%   -3.49%     
==========================================
  Files          52       57       +5     
  Lines        3145     3287     +142     
==========================================
  Hits         2536     2536              
- Misses        609      751     +142     
Flag Coverage Δ
anthropic/claude-3-haiku-20240307 76.11% <0.00%> (-3.44%) :arrow_down:
openai/gpt-4o-mini 75.84% <0.00%> (-3.43%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 30 '24 07:09 codecov-commenter

Anthropic announced that Claude 3.5 (new), aka Claude "3.6", performs 49% on SWE-Bench Verified, with a simple harness: https://www.anthropic.com/research/swe-bench-sonnet

I think optimizing for the particular benchmark might become less and less necessary over time, unless you want to squeeze performance out of smaller models.

Would be cool to make a proper run and get listed on the SWE-Bench leaderboard, though.

ErikBjare avatar Nov 01 '24 21:11 ErikBjare

I got it kinda working with swe-agent and this dataset which contains many more issues: https://huggingface.co/datasets/nebius/SWE-bench-extra

Might also integrate https://swe-rex.com/latest/ which seems pretty useful

My branch is a giant mess atm though 😭

bjsi avatar Dec 27 '24 09:12 bjsi

@bjsi Would be very interested to get it working if you can find the time to extract the relevant changes :pray:

ErikBjare avatar Jan 14 '25 15:01 ErikBjare

I also just found SWE-Gym: https://arxiv.org/abs/2412.21139

ErikBjare avatar Jan 14 '25 16:01 ErikBjare

I'll try to get back on this soon! At the very least I'll just share a gist which shows how to get the deps to install properly etc.

bjsi avatar Jan 14 '25 16:01 bjsi