OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Feat/mle bench evaluation

Open csmith49 opened this issue 1 year ago • 1 comments

End-user friendly description of the problem this fixes or functionality that this introduces

  • [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds support for testing OpenHands agents on MLE-bench using the standard OpenHands evaluation harness.

The MLE-bench implementation provides:

  1. A set of scripts to manage test instances, run benchmarks, and score results.
  2. A base Docker image in which agents should be run.
  3. An agent definition format.

The goal of this PR is to re-use as much existing infrastructure as possible by providing a suitable OpenHands agent definition. However, only the scripts from 1. are exposed as a Python package, so we assume the tester has OpenAI's implementation installed elsewhere to manage the base image and test instances and need to re-implement some minor scaffolding around agent definitions to allow for benchmarking from this repo.


Link of any specific issues this addresses

#4328

csmith49 avatar Nov 20 '24 17:11 csmith49