OpenHands Refactor of error handling

End-user friendly description of the problem this fixes or functionality that this introduces

[x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below CHANGELOG: Overhaul of error handling and reporting

This PR puts forward a new philosophy for error handling in OpenHands:

Errors caused by the agent being stupid are ErrorObservations
- these the agent can recover from automatically with feedback
Errors caused by fatal environment problems are Exceptions
- these stop the agent loop
- the user must take some action to continue
- some plumbing involved to get them back to the user, but runtime and agent_controller are both working nicely

Both types show up as messages in the chat window (no more toasts!). Both types of errors MAY include an error_id, which triggers a nice, translated message in the FE, and hides the (potentially ugly) details behind a button.

If no error_id is included, we fall back to just showing the details, which may be garbled nonsense like a stack trace or JSON from an API.

Give a summary of what the PR does, explaining any non-trivial design decisions

Right now, we have three types of errors:

ErrorObservation: tend to be triggered by the agent doing something stupid, which it can recover from, like sending a non-existent action or making a bad edit
- the agent can recover from these, so dropping them into the eventstream makes sense
FatalErrorObservation: tend to be triggered by critical errors like the runtime going down unexpectedly
- the agent cannot recover from these, but they need to be reported to the user
normal python exceptions: crop up randomly, and are sometimes converted into one of the above
- these are unexpected

This PR does away with FatalErrorObservation, and instead always raise exceptions for Errors that are unrelated to the agent doing its business.

ErrorObservations will continue to flow into the event stream, but we will never stop the loop. They will show up in the message window.

Exceptions will not flow into the event stream. Instead, they will stop the loop, and be sent back to the user, where they show up as a message in the chat window.

There are two major error flows that I've tested:

Wrong API key (currently the agent just hangs on the frontend, with this PR you get a nice message)
Runtime disconnected (currently we do a decent job here, but this makes it more robust)

I also removed a lot of the retry logic in EventStreamRuntime, which is a bit of an anti-pattern, and was hiding a lot of pain (e.g. something that obviously wasn't going to work would just stall the system for 5m because we retried on every Exception)

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:6ce2e6c-nikolaik   --name openhands-app-6ce2e6c   ghcr.io/all-hands-ai/runtime:6ce2e6c

Oct 26 '24 03:10 rbren

PR Reviewer Comments

After analyzing the pull request, here are my findings:

Overall Feedback:

The minor refactor of the agent controller is a positive step towards improving the codebase's maintainability and clarity. The changes made to handle errors more effectively and the removal of FatalErrorObservation in favor of a more general ErrorObservation are commendable. However, there are opportunities to enhance the code further, particularly in terms of consistency and error handling.

Score: 87/100

Labels: Bug fix, Enhancement

Code Suggestions:

File: openhands/runtime/utils/bash.py
- Suggestion Content: Ensure that the run method consistently returns the same type of observation to avoid type-related issues.
- Relevant Line: + def run(self, action: CmdRunAction) -> CmdOutputObservation | ErrorObservation:
- Existing Code:
```
def run(self, action: CmdRunAction) -> CmdOutputObservation | FatalErrorObservation:
```
- Improved Code:
```
def run(self, action: CmdRunAction) -> CmdOutputObservation | ErrorObservation:
```
File: openhands/runtime/utils/edit.py
- Suggestion Content: Consider using a consistent naming convention for error handling to improve readability and maintainability.
- Relevant Line: + return ErrorObservation(...)
- Existing Code:
```
return ErrorObservation(
    f'Fatal Runtime in editing: Expected FileWriteObservation, got {type(obs)}: {str(obs)}',
)
```
- Improved Code:
```
return ErrorObservation(
    f'Error in editing: Expected FileWriteObservation, got {type(obs)}: {str(obs)}',
)
```
File: tests/unit/test_agent_controller.py
- Suggestion Content: Ensure that the test names clearly reflect the functionality being tested for better clarity.
- Relevant Line: +async def test__react_to_error(mock_agent, mock_event_stream):
- Existing Code:
```
async def test_report_error(mock_agent, mock_event_stream):
```
- Improved Code:
```
async def test_react_to_error(mock_agent, mock_event_stream):
```

The minor refactor of the agent controller is a positive step! Here are some suggestions:

Ensure that the run method consistently returns the same type of observation to avoid type-related issues.
Consider using a consistent naming convention for error handling to improve readability.
Ensure that test names clearly reflect the functionality being tested for better clarity.

Oct 26 '24 16:10 ngduyanhece

Some notes:

first run, I'm seeing a huge amount of "Getting container logs.."; might be from the logging PR or some combo with the timeouts changes, etc, here
running with main.py: the first agent response is a MessageAction, not displayed in console
- oh, the user enters something ("print 55")
- then the previous agent message is displayed in console ("please let me know what you'd like me to do...")
- then the Observation
- then the Action that triggered the Obs 😅

I have no idea yet if it's in main, and the "environment" PR is also a possible suspect, but this is odd enough that I'll just write it down. @rbren

Nov 01 '24 05:11 enyst

fixed the Getting container logs.. spam! looking into the message thing...

Nov 01 '24 13:11 rbren

OK the message thing was really weird! First heisenbug I've seen in a long time--adding prints fixed it!

Turned out it was because of directly calling async_add_event instead of letting the event stream run the handlers in a different async loop. Fixed!

Nov 01 '24 13:11 rbren

The unit tests job got stuck twice, and it looks like the 3rd time it isn't better. Maybe it's not a fluke? It gets stuck in the controller tests.

Nov 01 '24 19:11 enyst

Whoops, lingering calls to async_add_event

Nov 02 '24 02:11 rbren

@enyst any particular workflows you see breaking?

Nov 02 '24 22:11 rbren

@enyst any particular workflows you see breaking?

Oh, I see the eval tasks really don't work at the moment. From the existing integration tests, I also don't see any that would fail, because they're happy paths, and we need things going wrong.

What seems likely to fail:

evals error report
- evals retrying on some errors, like swe-bench linked above
restoring session with main.py after stuck
maybe restoring session with UI after stuck
when the runtime has a serious issue/exception, with main.py, it blocks all.

Nov 02 '24 23:11 enyst

@xingyaoww "fatal" now means "shut everything down, the environment is in an irrecoverable state and the user needs to take action"

If e.g. an action times out, that's not fatal--the agent can try a different action, or increase the timeout (IIRC that's an option for the agent)

Nov 03 '24 17:11 rbren

OK I think I've addressed your comments @xingyaoww and @enyst! One last question for you though Xingyao

Thanks for the feedback--definitely helpful

Nov 04 '24 01:11 rbren

Quick notes from a number of runs on this branch, trying to get it "stuck in a loop", but I'm not lucky tonight:

a few runs fail (deepseek, gemini), but checked and it's not the branch; okay we really should test in general/release with function calling and without function calling, plus prompt caching b/c a few LLMs fail again on main too
running with 4o-mini:
- fails with UnicodeDecodeError and it gets completely stuck looping forever; I have to kill it
Log

Request user input >> make a bash script to solve the 32 queens problem. MUST BE BASH SCRIPT! 18:40:16 - openhands:DEBUG: stream.py:168 - Adding NullObservation id=49 from AGENT ERROR:asyncio:Task exception was never retrieved future: <Task finished name='Task-141' coro=<run_controller..on_event() done, defined at /Users/enyst/repos/odie/openhands/core/main.py:175> exception=UnicodeDecodeError('utf-8', b'make a bash script to solve the 32 queens problem. MUST BE BASH SCRIPT!\xc2', 71, 72, 'unexpected end of data')> Traceback (most recent call last): File "/Users/enyst/repos/odie/openhands/core/main.py", line 181, in on_event message = input('Request user input >> ') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 71: unexpected end of data
docker fails, apparently when there is already a container with the same hash; it looks like we don't clean up old containers at the end of the run, but at the start of the next one;
- on main, this exception was caught, logged, the old container was cleaned, and the run continued
- on branch, the exception was not caught, it was logged, the old container was cleaned, but it ends main.py; so next restart it works (in other words, it takes another restart before it works)
- if so, IMHO that's fine for now... for a human's use
- afaik there is now a retry functionality on evals that we may want to test with, though, because those scripts also use run_controller from main.py
Log
File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 209, in connect await call_sync_from_async(self._init_container) File "/Users/enyst/repos/odie/openhands/utils/async_utils.py", line 18, in call_sync_from_async result = await coro ^^^^^^^^^^ File "/Users/enyst/.pyenv/versions/3.12.6/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/openhands/utils/async_utils.py", line 17, in coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs)) ^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 344, in _init_container raise e File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 314, in _init_container self.container = self.docker_client.containers.run( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/models/containers.py", line 876, in run container = self.create(image=image, command=command, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/models/containers.py", line 935, in create resp = self.client.api.create_container(**create_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/container.py", line 440, in create_container return self.create_container_from_config(config, name, platform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/container.py", line 457, in create_container_from_config return self._result(res, True) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/client.py", line 281, in _result self._raise_for_status(response) File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/client.py", line 277, in _raise_for_status raise create_api_error_from_http_exception(e) from e ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) from e ERROR:root:: 409 Client Error for http+docker://localhost/v1.47/containers/create?name=openhands-runtime-deepseek-test3-1c9aa5c1ab5ff20d: Conflict ("Conflict. The container name "/openhands-runtime-deepseek-test3-1c9aa5c1ab5ff20d" is already in use by container "bd195e1b6e311c3b12d874f30e74c9a6ad3a4765078b182d30b4ef508724ceee". You have to remove (or rename) that container to be able to reuse that name.") (.venv) ➜ odie git:(rb/prevent-event-loop) ✗

Nov 04 '24 17:11 enyst

OK I actually saw the docker cleanup issue as well--will work on that.

Re: UnicodeDecodeError, I pushed back on making that a hard error, but maybe I'm wrong? Where did that weird byte come from?

@enyst any other blockers from you?

Nov 04 '24 17:11 rbren

OK added a more graceful handling of an existing container!

@enyst looking at your unicode issue it seems like you might have pasted something funky into the prompt...

Nov 04 '24 18:11 rbren

Re: unicode. I don't know, I just yelled at the LLM, hmm... OK, let's ignore until it happens again. 😅

I didn't get to see the implementation of the new loop, and I didn't get it stuck.

No blockers though.

Nov 04 '24 18:11 enyst

@rbren I migrate all the core eval logics into a single testcase. We can start it by doing:

export ALLHANDS_API_KEY="ah-YOUR-KEY"
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" 
pytest -vvxss tests/runtime/test_stress_remote_runtime.py

I changed number of workers to 1 in this PR, yet got this issue:

Nov 04 '24 19:11 xingyaoww

Thanks Xingyao, testing on my end!

@enyst you're still blocking merge on this one--if you're good w/ it, mind approving?

Nov 04 '24 19:11 rbren

Sorry @enyst my bad, looks like you're good! It's Xingyao 🙃

Nov 04 '24 19:11 rbren

Yes, in my opinion we should take this in as soon as we can. It's been getting difficult/slow to solve some types of issues on main without side effects, because the arch is broken and things are cropping up in the strangest places. We can work easier on this. ❤️

Nov 04 '24 19:11 enyst

My last concern is just evaluation (e.g., we remove a bunch of 503/404 wait which will break eval :( -- but if we hope to get this in soon for the release, i can definitely unblock this.

Nov 04 '24 20:11 xingyaoww

Stress tests are passing now! Thanks for writing that Xingyao, it did surface a bug where /alive returns 503 for a second even after the pod is ready (going to guess due to Traefik/ingress not being ready)

I think we're good to go! Enabling auto-merge...

Nov 04 '24 23:11 rbren