OpenHands eval: add Visual SWE-bench benchmark

End-user friendly description of the problem this fixes or functionality that this introduces

[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions Visual SWE-bench focuses on visual issues and features a structure similar to SWE-bench, where each problem statement includes visual data. This PR enables OpenHands to use the evaluation Docker image for inference and evaluation on this benchmark.

Link of any specific issues this addresses

Dec 30 '24 10:12 luolin101

Hi @xingyaoww, just a gentle reminder about this PR. Please let me know if you need any clarification or updates from my side. Thank you!

Jan 04 '25 06:01 luolin101

Hi @xingyaoww, we have now removed the unused scripts as you suggested. Please let me know if there’s anything else that needs improvement or clarification. Thank you!

Jan 21 '25 11:01 luolin101

@luolin101 are you mostly waiting for Xingyao's review here?

Feb 03 '25 15:02 mamoodi

@luolin101 are you mostly waiting for Xingyao's review here?

@mamoodi Yes, I'm waiting for his review to merge the PR.

Feb 03 '25 16:02 luolin101

@luolin101 Thank you for this! I'm a bit unsure about the /examples directory with outputs. We don't usually store evaluation outputs in the repository, they're very large. For example, the ./evaluation_outputs/outputs directory in swe-bench is always empty in the repository.

Do you think they are needed here, or could we leave them out?

Feb 03 '25 16:02 enyst

@luolin101 Thank you for this! I'm a bit unsure about the /examples directory with outputs. We don't usually store evaluation outputs in the repository, they're very large. For example, the ./evaluation_outputs/outputs directory in swe-bench is always empty in the repository.

Do you think they are needed here, or could we leave them out?

@enyst Thank you for your feedback. I think the /examples directory can be omitted. Our code is quite similar to swe-bench, and after running run_infer.sh, we obtain an output.jsonl, which is stored by default in the evaluation/evaluation_outputs directory. Running eval_infer.sh will allow us to evaluate the results.

Feb 04 '25 02:02 luolin101

@enyst @xingyaoww Is there anything else I can clarify? I'd be glad to assist in moving this PR forward when possible.

Feb 11 '25 12:02 luolin101

@luolin101 sorry for the long wait! I've tried running the inference and patch eval scripts and it seems to work fine! Could you please have a look to merge the latest change, resolve conflicts and then @xingyaoww can have a final look.

Mar 03 '25 13:03 ryanhoangt