OpenHands Fix issue #5222: [Refactor]: Refactor the evaluation directory

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

Created evaluation/benchmarks/ directory to house all ML literature benchmarks
Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
Updated paths in documentation and GitHub workflows to reflect the new structure
Added missing benchmarks to evaluation/README.md:
- Commit0 and DiscoveryBench under Software Engineering
- Browsing Delegation under Web Browsing
- ScienceAgentBench under Misc. Assistance

Testing

All pre-commit hooks pass (ruff, mypy, etc.)
All unit tests pass (377 tests)

Review Notes

Key files to review:

.github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
evaluation/README.md - Added missing benchmarks and updated paths
Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b28439d-nikolaik   --name openhands-app-b28439d   docker.all-hands.dev/all-hands-ai/openhands:b28439d

Nov 23 '24 13:11 openhands-agent

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

Nov 23 '24 13:11 neubig

I'm trying to run 1 instance of swe-bench on this PR, and I get this error:

....venv/bin/python: can't open file '/Users/enyst/repos/odie/evaluation/swe_bench/run_infer.py': [Errno 2] No such file or directory

The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file:

e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line

COMMAND="poetry run python evaluation/swe_bench/run_infer.py \

Nov 23 '24 18:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 18:11 github-actions[bot]

New OpenHands update

Nov 23 '24 19:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 19:11 github-actions[bot]

@openhands-agent The previous comment was fixed for swe-bench. For every other benchmark that was moved from ./evaluation directory to ./evaluation/benchmarks/, we need to verify if there are scripts that hard-coded the old structure when calling some other script. For example, in ./evaluation/benchmarks/webarena, we can search for .sh scripts, and we will find:

./evaluation/benchmarks/webarena/scripts/run_infer.sh

Read the script. We will find a line like:

COMMAND="poetry run python evaluation/webarena/run_infer.py \

This will fail because the new location is in evaluation/benchmarks/webarena/ directory.

You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly.

Nov 23 '24 19:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 19:11 github-actions[bot]

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 19:11 github-actions[bot]

New OpenHands update

Nov 23 '24 19:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 19:11 github-actions[bot]

The previous comments were fixed for shell scripts.

After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark.

You can read it, and see it has, for example, a line like this:

./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh.

You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue.

Nov 23 '24 20:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 20:11 github-actions[bot]

New OpenHands update

Nov 23 '24 20:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 20:11 github-actions[bot]

Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks.

Some were, some were not. So we need to fix the rest.

But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time.

You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now.

Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command.

First, think how to do this better this time. You have two options:

you can take more context every time you want to use the str_replace_editor tool. Not a lot more, just more than the patterned string itself.
use the bash tool instead, with a bash command to perform the replacements in each .md file all at once. In this case, you need to find them first, to know what you need to do.

Make a decision, then perform it. Do not ask me about it.

Nov 23 '24 21:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 21:11 github-actions[bot]

New OpenHands update

Nov 23 '24 21:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 21:11 github-actions[bot]

You're good! Your choice to run bash scripts was brilliant!

You fixed the rest of the documentation for human users in only 4 steps this time.

Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them.

Leftovers could be in:

.py files. This is by far the most important! The module names can break everything. You did so well with the bash tool previously, that I suggest you try it again to verify all imports in .py files or any other occurrences of the patterns, in the benchmarks files.
.sh scripts. Verify them and fix if necessary.
You can skip .md files this time, you did a very good verification last time.
Other files.

Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with.

Check in order and update as needed.

FINALLY: When you finish with all of them, I want you to make an extra check:

in the root directory this time, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks.

Nov 23 '24 21:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 21:11 github-actions[bot]

New OpenHands update

Nov 23 '24 21:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 21:11 github-actions[bot]

@openhands-agent-exp Your bash skills may be impressive, but you got lazy last time and you forgot under what constraints you were supposed to be working! I had to revert your last attempt.

Be careful. We have only a couple things left to fix here:

In our new ./evaluation/benchmarks directory, find and read carefully:

from Gaia, run_infer.sh
from mint, README.md
from swe-bench, eval_infer.sh You know how to find the exact files. Note that they may be in a subdirectory. Read those files and if they need fixes, fix them. I prefer you just take it easy with str_replace_editor tool this time. Just be precise! They're half-updated probably, so make sure you don't overwrite what is already up to date.

In the root directory, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks. Do not duplicate references. Make sure you add missing ones.

Remember to list all benchmarks first, it helps you know what to work with.

Nov 23 '24 22:11 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Nov 23 '24 22:11 github-actions[bot]

New OpenHands update

Nov 23 '24 22:11 openhands-agent

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Nov 23 '24 22:11 github-actions[bot]

Hey @enyst , thanks so much for cleaning this up!

Nov 25 '24 13:11 neubig