OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

[Evaluation] Fix SWE-Bench Evaluation on Devin's Output

Open xingyaoww opened this issue 11 months ago • 2 comments

Following instructions here, you will set up prediction files from Devin, and run evaluation using OpenDevin's SWE-Bench fork.

This task aims to ensure the SWE-Bench evaluation (using OpenDevin's fork) can successfully run on all of Devin's prediction files. Instead of sending PR to this repo, you should fix issues and send PRs to our SWE-Bench fork.

I have attached the log file with multiple issues running SWE-Bench on Devin's output -- Search for 'Traceback' to find exact error messages.

swe-bench-devin.log

A suggested way to get started: You may try to create one prediction JSON (see more about the prediction file format here) from each SWE-Bench repo (e.g., you will have data/predictions/sklearn.json, data/predictions/matplotlibs.json, etc). Then, you may try to run evaluations on them to debug repositories one-by-one until the issue is fixed.

xingyaoww avatar Mar 22 '24 10:03 xingyaoww

@xingyaoww Picking this up

guneetsk99 avatar Mar 22 '24 11:03 guneetsk99

I think @libowen2121 is taking a look at this, thanks!

neubig avatar Apr 14 '24 02:04 neubig