Feat/mle bench evaluation
End-user friendly description of the problem this fixes or functionality that this introduces
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.
The MLE-bench implementation provides:
- A set of scripts to manage test instances, run benchmarks, and score results.
- A base Docker image in which agents should be run.
- An
agentdefinition format.
The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.
Link of any specific issues this addresses
#4328