OpenHands Support GAIA benchmark

See Issue #1865 This PR introduces the GAIA benchmark as part of the evaluation harness. Currently, this is a draft version with known limitations and bugs:

File Handling: Some questions come with attached files (e.g., png, MOV, xml, xlsx, txt, json). At present, we simply move them to the workspace and inform the agent that the file is available.
- To reach a good score on gaia, the agents need to handle these files properly, and we should consider adding vision support to the agents.
Agent Hang Issue: The agent hangs and doesn’t stop after triggering finish.
- This issue has been discussed with @xingyaoww, and we believe it might be a bug in the agent implementation or the browser integration.

To reproduce error 2, run

python ./evaluation/gaia/run_infer.py \
--level 2023_level1 \
--data-split validation \
--eval-n-limit 1

May 20 '24 05:05 Jiayi-Pan

@Jiayi-Pan

I got the following error when trying to reproduce:

ERROR:root:<class 'datasets.exceptions.DatasetNotFoundError'>: Dataset 'gaia-benchmark/GAIA' doesn't exist on the Hub or cannot be accessed. If the dataset is private or gated, make sure to log in with huggingface-cli login or visit the dataset page at https://huggingface.co/datasets/gaia-benchmark/GAIA to ask for access.

Is it intended to be private for now?

UPDATE: never mind, I see it's public, and I just need to accept the conditions. UPDATE2: I can reproduce the issue! UPDATE3: It looks like there's a bug with browser env. I'll look into it and push a fix to your branch directly to unblock you, and then create a separate PR to fix the problem. ETA: 24 hours (it's midnight at PT right now). UPDATE4: A fix has been pushed to this branch directly. I also opened https://github.com/OpenDevin/OpenDevin/pull/1933 to fix it on main. UPDATE5: Fix pushed to main, and merged back to this branch.

May 20 '24 06:05 li-boxuan

@Jiayi-Pan Could you please try again? I just pushed a fix to your branch.

May 20 '24 07:05 li-boxuan

@Jiayi-Pan I think for 1, the agent as the ability to open file:// inside the browser, and the browser observation will return a representation (screenshot also, but now there's no multimodal model yet). Maybe we could still claim that these are possible even for non-text-only format such as MOV

May 20 '24 11:05 frankxu2004

@frankxu2004 But can the browser (in the app container) actually access file in the sandbox correctly though?

May 20 '24 11:05 xingyaoww

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

May 20 '24 11:05 frankxu2004

I am little confuse about the first point. Could someone explain more? Thanks for anyone's explanation.

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

we should consider adding vision support to the agents.

What vision support you want the agents have? Do you mean agents can access/open other kinds of file? (e.g., png, MOV, xml, xlsx, txt, json). I find #1914, maybe this is what you mean?

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

May 20 '24 18:05 yufansong

@yufansong

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

Currently we don't use any multimodal LLMs, so the screenshot is now for frontend showcase only (the browser tab in the frontend showing what the current browser state looks like.

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

Since the browser already have screenshot support in our codebase, as well as support for more complex file handling (e.g. the browser can open PDF files, but not so much for the cmdline), by allowing the browser to access files inside the sandbox's workspace should enable such scenario. Also, if the multimodal LLM takes in browser screenshot images in the future, this could be a way of unifying visual observation of various files. Still, we could still do filetype-specific handling as the other PR you mentioned is doing.

As for the web server thing - original I thought the files in the sandbox will not be directly accessible to the browser running on host. But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

May 20 '24 22:05 frankxu2004

But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

That's sweet! Otherwise, hosting a web server just to serve static files under /workspace seems an anti-pattern to me.

May 21 '24 01:05 li-boxuan

Thanks everyone for the discussion! And thanks to @li-boxuan for fixing the agent hang bug.

After a few more bug fixes, I believe the Gaia evaluation harness is now pretty much complete. Although there’s still work to be done to develop a high-performing agent on gaia.

Should we merge this PR now and continue agent development in other threads? For instance, we have an ongoing PR focused on multimodal understanding, #1914.

May 23 '24 04:05 Jiayi-Pan

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself. The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

May 23 '24 04:05 Jiayi-Pan

@Jiayi-Pan this is amazing:

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself. The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@xingyaoww I'd like to get your review on this if possible, as the evals guru.

May 23 '24 15:05 neubig

thanks @xingyaoww for the review. I've addressed all the issues.

May 24 '24 02:05 Jiayi-Pan

I tried to run it yesterday. looks fine.

May 24 '24 03:05 iFurySt

I think we can get this merged, and improve it in future PRs:

Improve browsing for CodeAct
Integrate agentskills to support reading multimodal documents: https://github.com/OpenDevin/OpenDevin/pull/1914, https://github.com/OpenDevin/OpenDevin/pull/2016

I'd appreciate if anyone can take a quick look at my newest changes? If they looks good - we can merge this one.

May 24 '24 11:05 xingyaoww

Hi @xingyaoww , can you post to the GAIA benchmark the openhands result? Thanks.

Jan 08 '25 22:01 pseudotensor