OpenHands
OpenHands copied to clipboard
[Evaluation] Fix SWE-Bench Evaluation on Devin's Output
Following instructions here, you will set up prediction files from Devin, and run evaluation using OpenDevin's SWE-Bench
fork.
This task aims to ensure the SWE-Bench evaluation (using OpenDevin's fork) can successfully run on all of Devin's prediction files. Instead of sending PR to this repo, you should fix issues and send PRs to our SWE-Bench fork.
I have attached the log file with multiple issues running SWE-Bench on Devin's output -- Search for 'Traceback' to find exact error messages.
A suggested way to get started: You may try to create one prediction JSON (see more about the prediction file format here) from each SWE-Bench repo (e.g., you will have data/predictions/sklearn.json
, data/predictions/matplotlibs.json
, etc). Then, you may try to run evaluations on them to debug repositories one-by-one until the issue is fixed.
@xingyaoww Picking this up
I think @libowen2121 is taking a look at this, thanks!