OpenHands [Evaluation] Fix SWE-Bench Evaluation on Devin's Output

[Evaluation] Fix SWE-Bench Evaluation on Devin's Output

Open xingyaoww opened this issue 11 months ago • 2 comments

Following instructions here, you will set up prediction files from Devin, and run evaluation using OpenDevin's SWE-Bench fork.

This task aims to ensure the SWE-Bench evaluation (using OpenDevin's fork) can successfully run on all of Devin's prediction files. Instead of sending PR to this repo, you should fix issues and send PRs to our SWE-Bench fork.

I have attached the log file with multiple issues running SWE-Bench on Devin's output -- Search for 'Traceback' to find exact error messages.

swe-bench-devin.log

A suggested way to get started: You may try to create one prediction JSON (see more about the prediction file format here) from each SWE-Bench repo (e.g., you will have data/predictions/sklearn.json, data/predictions/matplotlibs.json, etc). Then, you may try to run evaluations on them to debug repositories one-by-one until the issue is fixed.

Mar 22 '24 10:03 xingyaoww

@xingyaoww Picking this up

Mar 22 '24 11:03 guneetsk99

I think @libowen2121 is taking a look at this, thanks!

Apr 14 '24 02:04 neubig

OpenHands OpenHands copied to clipboard

[Evaluation] Fix SWE-Bench Evaluation on Devin's Output

OpenHands
OpenHands copied to clipboard