OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Support GAIA benchmark

Open Jiayi-Pan opened this issue 1 year ago • 13 comments

See Issue #1865 This PR introduces the GAIA benchmark as part of the evaluation harness. Currently, this is a draft version with known limitations and bugs:

  1. File Handling: Some questions come with attached files (e.g., png, MOV, xml, xlsx, txt, json). At present, we simply move them to the workspace and inform the agent that the file is available.
    • To reach a good score on gaia, the agents need to handle these files properly, and we should consider adding vision support to the agents.
  2. Agent Hang Issue: The agent hangs and doesn’t stop after triggering finish.
    • This issue has been discussed with @xingyaoww, and we believe it might be a bug in the agent implementation or the browser integration. image

To reproduce error 2, run

python ./evaluation/gaia/run_infer.py \
--level 2023_level1 \
--data-split validation \
--eval-n-limit 1

Jiayi-Pan avatar May 20 '24 05:05 Jiayi-Pan

@Jiayi-Pan

I got the following error when trying to reproduce:

ERROR:root:<class 'datasets.exceptions.DatasetNotFoundError'>: Dataset 'gaia-benchmark/GAIA' doesn't exist on the Hub or cannot be accessed. If the dataset is private or gated, make sure to log in with huggingface-cli login or visit the dataset page at https://huggingface.co/datasets/gaia-benchmark/GAIA to ask for access.

Is it intended to be private for now?

UPDATE: never mind, I see it's public, and I just need to accept the conditions. UPDATE2: I can reproduce the issue! UPDATE3: It looks like there's a bug with browser env. I'll look into it and push a fix to your branch directly to unblock you, and then create a separate PR to fix the problem. ETA: 24 hours (it's midnight at PT right now). UPDATE4: A fix has been pushed to this branch directly. I also opened https://github.com/OpenDevin/OpenDevin/pull/1933 to fix it on main. UPDATE5: Fix pushed to main, and merged back to this branch.

li-boxuan avatar May 20 '24 06:05 li-boxuan

@Jiayi-Pan Could you please try again? I just pushed a fix to your branch.

li-boxuan avatar May 20 '24 07:05 li-boxuan

@Jiayi-Pan I think for 1, the agent as the ability to open file:// inside the browser, and the browser observation will return a representation (screenshot also, but now there's no multimodal model yet). Maybe we could still claim that these are possible even for non-text-only format such as MOV

frankxu2004 avatar May 20 '24 11:05 frankxu2004

@frankxu2004 But can the browser (in the app container) actually access file in the sandbox correctly though?

xingyaoww avatar May 20 '24 11:05 xingyaoww

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

frankxu2004 avatar May 20 '24 11:05 frankxu2004

I am little confuse about the first point. Could someone explain more? Thanks for anyone's explanation.

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

we should consider adding vision support to the agents.

What vision support you want the agents have? Do you mean agents can access/open other kinds of file? (e.g., png, MOV, xml, xlsx, txt, json). I find #1914, maybe this is what you mean?

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

yufansong avatar May 20 '24 18:05 yufansong

@yufansong

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

Currently we don't use any multimodal LLMs, so the screenshot is now for frontend showcase only (the browser tab in the frontend showing what the current browser state looks like.

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

Since the browser already have screenshot support in our codebase, as well as support for more complex file handling (e.g. the browser can open PDF files, but not so much for the cmdline), by allowing the browser to access files inside the sandbox's workspace should enable such scenario. Also, if the multimodal LLM takes in browser screenshot images in the future, this could be a way of unifying visual observation of various files. Still, we could still do filetype-specific handling as the other PR you mentioned is doing.

As for the web server thing - original I thought the files in the sandbox will not be directly accessible to the browser running on host. But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

frankxu2004 avatar May 20 '24 22:05 frankxu2004

But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

That's sweet! Otherwise, hosting a web server just to serve static files under /workspace seems an anti-pattern to me.

li-boxuan avatar May 21 '24 01:05 li-boxuan

Thanks everyone for the discussion! And thanks to @li-boxuan for fixing the agent hang bug.

After a few more bug fixes, I believe the Gaia evaluation harness is now pretty much complete. Although there’s still work to be done to develop a high-performing agent on gaia.

Should we merge this PR now and continue agent development in other threads? For instance, we have an ongoing PR focused on multimodal understanding, #1914.

Jiayi-Pan avatar May 23 '24 04:05 Jiayi-Pan

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself. The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

Jiayi-Pan avatar May 23 '24 04:05 Jiayi-Pan

@Jiayi-Pan this is amazing:

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself. The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@xingyaoww I'd like to get your review on this if possible, as the evals guru.

neubig avatar May 23 '24 15:05 neubig

thanks @xingyaoww for the review. I've addressed all the issues.

Jiayi-Pan avatar May 24 '24 02:05 Jiayi-Pan

I tried to run it yesterday. looks fine.

iFurySt avatar May 24 '24 03:05 iFurySt

I think we can get this merged, and improve it in future PRs:

  • Improve browsing for CodeAct
  • Integrate agentskills to support reading multimodal documents: https://github.com/OpenDevin/OpenDevin/pull/1914, https://github.com/OpenDevin/OpenDevin/pull/2016

I'd appreciate if anyone can take a quick look at my newest changes? If they looks good - we can merge this one.

xingyaoww avatar May 24 '24 11:05 xingyaoww

Hi @xingyaoww , can you post to the GAIA benchmark the openhands result? Thanks.

pseudotensor avatar Jan 08 '25 22:01 pseudotensor