Refactor of error handling
End-user friendly description of the problem this fixes or functionality that this introduces
- [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below CHANGELOG: Overhaul of error handling and reporting
This PR puts forward a new philosophy for error handling in OpenHands:
- Errors caused by the agent being stupid are ErrorObservations
- these the agent can recover from automatically with feedback
- Errors caused by fatal environment problems are Exceptions
- these stop the agent loop
- the user must take some action to continue
- some plumbing involved to get them back to the user, but runtime and agent_controller are both working nicely
Both types show up as messages in the chat window (no more toasts!). Both types of errors MAY include an error_id, which triggers a nice, translated message in the FE, and hides the (potentially ugly) details behind a button.
If no error_id is included, we fall back to just showing the details, which may be garbled nonsense like a stack trace or JSON from an API.
Give a summary of what the PR does, explaining any non-trivial design decisions
Right now, we have three types of errors:
- ErrorObservation: tend to be triggered by the agent doing something stupid, which it can recover from, like sending a non-existent action or making a bad edit
- the agent can recover from these, so dropping them into the eventstream makes sense
- FatalErrorObservation: tend to be triggered by critical errors like the runtime going down unexpectedly
- the agent cannot recover from these, but they need to be reported to the user
- normal python exceptions: crop up randomly, and are sometimes converted into one of the above
- these are unexpected
This PR does away with FatalErrorObservation, and instead always raise exceptions for Errors that are unrelated to the agent doing its business.
ErrorObservations will continue to flow into the event stream, but we will never stop the loop. They will show up in the message window.
Exceptions will not flow into the event stream. Instead, they will stop the loop, and be sent back to the user, where they show up as a message in the chat window.
There are two major error flows that I've tested:
- Wrong API key (currently the agent just hangs on the frontend, with this PR you get a nice message)
- Runtime disconnected (currently we do a decent job here, but this makes it more robust)
I also removed a lot of the retry logic in EventStreamRuntime, which is a bit of an anti-pattern, and was hiding a lot of pain (e.g. something that obviously wasn't going to work would just stall the system for 5m because we retried on every Exception)
Link of any specific issues this addresses
To run this PR locally, use the following command:
docker run -it --rm -p 3000:3000 -v /var/run/docker.sock:/var/run/docker.sock --add-host host.docker.internal:host-gateway -e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:6ce2e6c-nikolaik --name openhands-app-6ce2e6c ghcr.io/all-hands-ai/runtime:6ce2e6c
PR Reviewer Comments
After analyzing the pull request, here are my findings:
Overall Feedback:
The minor refactor of the agent controller is a positive step towards improving the codebase's maintainability and clarity. The changes made to handle errors more effectively and the removal of FatalErrorObservation in favor of a more general ErrorObservation are commendable. However, there are opportunities to enhance the code further, particularly in terms of consistency and error handling.
Score: 87/100
Labels: Bug fix, Enhancement
Code Suggestions:
-
File:
openhands/runtime/utils/bash.py- Suggestion Content: Ensure that the
runmethod consistently returns the same type of observation to avoid type-related issues. - Relevant Line:
+ def run(self, action: CmdRunAction) -> CmdOutputObservation | ErrorObservation: - Existing Code:
def run(self, action: CmdRunAction) -> CmdOutputObservation | FatalErrorObservation: - Improved Code:
def run(self, action: CmdRunAction) -> CmdOutputObservation | ErrorObservation:
- Suggestion Content: Ensure that the
-
File:
openhands/runtime/utils/edit.py- Suggestion Content: Consider using a consistent naming convention for error handling to improve readability and maintainability.
- Relevant Line:
+ return ErrorObservation(...) - Existing Code:
return ErrorObservation( f'Fatal Runtime in editing: Expected FileWriteObservation, got {type(obs)}: {str(obs)}', ) - Improved Code:
return ErrorObservation( f'Error in editing: Expected FileWriteObservation, got {type(obs)}: {str(obs)}', )
-
File:
tests/unit/test_agent_controller.py- Suggestion Content: Ensure that the test names clearly reflect the functionality being tested for better clarity.
- Relevant Line:
+async def test__react_to_error(mock_agent, mock_event_stream): - Existing Code:
async def test_report_error(mock_agent, mock_event_stream): - Improved Code:
async def test_react_to_error(mock_agent, mock_event_stream):
The minor refactor of the agent controller is a positive step! Here are some suggestions:
- Ensure that the
runmethod consistently returns the same type of observation to avoid type-related issues. - Consider using a consistent naming convention for error handling to improve readability.
- Ensure that test names clearly reflect the functionality being tested for better clarity.
Some notes:
- first run, I'm seeing a huge amount of "Getting container logs.."; might be from the logging PR or some combo with the timeouts changes, etc, here
- running with
main.py: the first agent response is a MessageAction, not displayed in console- oh, the user enters something ("print 55")
- then the previous agent message is displayed in console ("please let me know what you'd like me to do...")
- then the Observation
- then the Action that triggered the Obs 😅
I have no idea yet if it's in main, and the "environment" PR is also a possible suspect, but this is odd enough that I'll just write it down. @rbren
fixed the Getting container logs.. spam! looking into the message thing...
OK the message thing was really weird! First heisenbug I've seen in a long time--adding prints fixed it!
Turned out it was because of directly calling async_add_event instead of letting the event stream run the handlers in a different async loop. Fixed!
The unit tests job got stuck twice, and it looks like the 3rd time it isn't better. Maybe it's not a fluke? It gets stuck in the controller tests.
Whoops, lingering calls to async_add_event
@enyst any particular workflows you see breaking?
@enyst any particular workflows you see breaking?
Oh, I see the eval tasks really don't work at the moment. From the existing integration tests, I also don't see any that would fail, because they're happy paths, and we need things going wrong.
What seems likely to fail:
- evals error report
- evals retrying on some errors, like swe-bench linked above
- restoring session with
main.pyafter stuck - maybe restoring session with UI after stuck
- when the runtime has a serious issue/exception, with
main.py, it blocks all.
@xingyaoww "fatal" now means "shut everything down, the environment is in an irrecoverable state and the user needs to take action"
If e.g. an action times out, that's not fatal--the agent can try a different action, or increase the timeout (IIRC that's an option for the agent)
OK I think I've addressed your comments @xingyaoww and @enyst! One last question for you though Xingyao
Thanks for the feedback--definitely helpful
Quick notes from a number of runs on this branch, trying to get it "stuck in a loop", but I'm not lucky tonight:
-
a few runs fail (deepseek, gemini), but checked and it's not the branch; okay we really should test in general/release with function calling and without function calling, plus prompt caching b/c a few LLMs fail again on
maintoo -
running with 4o-mini:
- fails with
UnicodeDecodeErrorand it gets completely stuck looping forever; I have to kill it
Log
Request user input >> make a bash script to solve the 32 queens problem. MUST BE BASH SCRIPT! 18:40:16 - openhands:DEBUG: stream.py:168 - Adding NullObservation id=49 from AGENT ERROR:asyncio:Task exception was never retrieved future: <Task finished name='Task-141' coro=<run_controller.
.on_event() done, defined at /Users/enyst/repos/odie/openhands/core/main.py:175> exception=UnicodeDecodeError('utf-8', b'make a bash script to solve the 32 queens problem. MUST BE BASH SCRIPT!\xc2', 71, 72, 'unexpected end of data')> Traceback (most recent call last): File "/Users/enyst/repos/odie/openhands/core/main.py", line 181, in on_event message = input('Request user input >> ') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 71: unexpected end of data - fails with
-
docker fails, apparently when there is already a container with the same hash; it looks like we don't clean up old containers at the end of the run, but at the start of the next one;
- on main, this exception was caught, logged, the old container was cleaned, and the run continued
- on branch, the exception was not caught, it was logged, the old container was cleaned, but it ends
main.py; so next restart it works (in other words, it takes another restart before it works) - if so, IMHO that's fine for now... for a human's use
- afaik there is now a retry functionality on evals that we may want to test with, though, because those scripts also use
run_controllerfrommain.py
Log
File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 209, in connect await call_sync_from_async(self._init_container) File "/Users/enyst/repos/odie/openhands/utils/async_utils.py", line 18, in call_sync_from_async result = await coro ^^^^^^^^^^ File "/Users/enyst/.pyenv/versions/3.12.6/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/openhands/utils/async_utils.py", line 17, incoro = loop.run_in_executor(None, lambda: fn(*args, **kwargs)) ^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 344, in _init_container raise e File "/Users/enyst/repos/odie/openhands/runtime/impl/eventstream/eventstream_runtime.py", line 314, in _init_container self.container = self.docker_client.containers.run( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/models/containers.py", line 876, in run container = self.create(image=image, command=command, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/models/containers.py", line 935, in create resp = self.client.api.create_container(**create_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/container.py", line 440, in create_container return self.create_container_from_config(config, name, platform) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/container.py", line 457, in create_container_from_config return self._result(res, True) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/client.py", line 281, in _result self._raise_for_status(response) File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/api/client.py", line 277, in _raise_for_status raise create_api_error_from_http_exception(e) from e ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/enyst/repos/odie/.venv/lib/python3.12/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) from e ERROR:root: : 409 Client Error for http+docker://localhost/v1.47/containers/create?name=openhands-runtime-deepseek-test3-1c9aa5c1ab5ff20d: Conflict ("Conflict. The container name "/openhands-runtime-deepseek-test3-1c9aa5c1ab5ff20d" is already in use by container "bd195e1b6e311c3b12d874f30e74c9a6ad3a4765078b182d30b4ef508724ceee". You have to remove (or rename) that container to be able to reuse that name.") (.venv) ➜ odie git:(rb/prevent-event-loop) ✗
OK I actually saw the docker cleanup issue as well--will work on that.
Re: UnicodeDecodeError, I pushed back on making that a hard error, but maybe I'm wrong? Where did that weird byte come from?
@enyst any other blockers from you?
OK added a more graceful handling of an existing container!
@enyst looking at your unicode issue it seems like you might have pasted something funky into the prompt...
Re: unicode. I don't know, I just yelled at the LLM, hmm... OK, let's ignore until it happens again. 😅
I didn't get to see the implementation of the new loop, and I didn't get it stuck.
No blockers though.
@rbren I migrate all the core eval logics into a single testcase. We can start it by doing:
export ALLHANDS_API_KEY="ah-YOUR-KEY"
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
pytest -vvxss tests/runtime/test_stress_remote_runtime.py
I changed number of workers to 1 in this PR, yet got this issue:
Thanks Xingyao, testing on my end!
@enyst you're still blocking merge on this one--if you're good w/ it, mind approving?
Sorry @enyst my bad, looks like you're good! It's Xingyao 🙃
Yes, in my opinion we should take this in as soon as we can. It's been getting difficult/slow to solve some types of issues on main without side effects, because the arch is broken and things are cropping up in the strangest places. We can work easier on this. ❤️
My last concern is just evaluation (e.g., we remove a bunch of 503/404 wait which will break eval :( -- but if we hope to get this in soon for the release, i can definitely unblock this.
Stress tests are passing now! Thanks for writing that Xingyao, it did surface a bug where /alive returns 503 for a second even after the pod is ready (going to guess due to Traefik/ingress not being ready)
I think we're good to go! Enabling auto-merge...