OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Support GAIA benchmark

Open Jiayi-Pan opened this issue 1 year ago • 2 comments

What problem or use case are you trying to solve? Just discussed with @xingyaoww over Slack and we'd like to enable benchmarking OpenDevin on GAIA. Compared to coding-centric benchmarks like SWE-bench, GAIA can provide a more comprehensive view on agent's ability for general assistance tasks.

Describe the UX of the solution you'd like Benchmarking a model(agent)'s GAIA score through a few simple commands

Do you have thoughts on the technical implementation?

  • Add image input support for LLMs
    • litellm already supports image input and we will need to add corresponding functionality on Open-Devin side
  • Evaluation utilities

Jiayi-Pan avatar May 17 '24 21:05 Jiayi-Pan

What baseline agent do we want to test? I wonder what level of browsing capability is required

frankxu2004 avatar May 17 '24 23:05 frankxu2004

I think it's mostly information seeking. Besides, the benchmark covers a wide range of difficulties and other scenarios. So we don't need to worry about getting 0 score lol

Jiayi-Pan avatar May 18 '24 02:05 Jiayi-Pan