Feat/mle bench evaluation

Open csmith49 opened this issue 1 year ago • 1 comments

End-user friendly description of the problem this fixes or functionality that this introduces

[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.

The MLE-bench implementation provides:

A set of scripts to manage test instances, run benchmarks, and score results.
A base Docker image in which agents should be run.
An agent definition format.

The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.

Link of any specific issues this addresses

#4328

Nov 20 '24 17:11 csmith49