OpenHands
OpenHands copied to clipboard
add Visual SWE-bench benchmark
End-user friendly description of the problem this fixes or functionality that this introduces.
Give a summary of what the PR does, explaining any non-trivial design decisions. Visual SWE-bench focuses on visual issues and features a structure similar to SWE-bench, where each problem statement includes visual data. This PR enables OpenHands to use the evaluation Docker image for inference and evaluation on this benchmark. This new PR is built on #5911.
Link of any specific issues this addresses.