eval: add Visual SWE-bench benchmark
End-user friendly description of the problem this fixes or functionality that this introduces
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Give a summary of what the PR does, explaining any non-trivial design decisions Visual SWE-bench focuses on visual issues and features a structure similar to SWE-bench, where each problem statement includes visual data. This PR enables OpenHands to use the evaluation Docker image for inference and evaluation on this benchmark.
Link of any specific issues this addresses
Hi @xingyaoww, just a gentle reminder about this PR. Please let me know if you need any clarification or updates from my side. Thank you!
Hi @xingyaoww, we have now removed the unused scripts as you suggested. Please let me know if there’s anything else that needs improvement or clarification. Thank you!
@luolin101 are you mostly waiting for Xingyao's review here?
@luolin101 are you mostly waiting for Xingyao's review here?
@mamoodi Yes, I'm waiting for his review to merge the PR.
@luolin101 Thank you for this! I'm a bit unsure about the /examples directory with outputs. We don't usually store evaluation outputs in the repository, they're very large. For example, the ./evaluation_outputs/outputs directory in swe-bench is always empty in the repository.
Do you think they are needed here, or could we leave them out?
@luolin101 Thank you for this! I'm a bit unsure about the /examples directory with outputs. We don't usually store evaluation outputs in the repository, they're very large. For example, the ./evaluation_outputs/outputs directory in swe-bench is always empty in the repository.
Do you think they are needed here, or could we leave them out?
@enyst Thank you for your feedback. I think the /examples directory can be omitted. Our code is quite similar to swe-bench, and after running run_infer.sh, we obtain an output.jsonl, which is stored by default in the evaluation/evaluation_outputs directory. Running eval_infer.sh will allow us to evaluate the results.
@enyst @xingyaoww Is there anything else I can clarify? I'd be glad to assist in moving this PR forward when possible.
@luolin101 sorry for the long wait! I've tried running the inference and patch eval scripts and it seems to work fine! Could you please have a look to merge the latest change, resolve conflicts and then @xingyaoww can have a final look.