[Bug]: OH fails to join existing conversations after an unclean exit
[EDIT] Skip this and go directly to https://github.com/All-Hands-AI/OpenHands/issues/6148#issuecomment-2578502601
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Describe the bug and reproduction steps
This happens because join_conversation() is not calling self.maybe_start_agent_loop which starts the sandbox container, when event_stream is True:
event_stream = await self._get_event_stream(sid)
if not event_stream:
return await self.maybe_start_agent_loop(sid, settings)
OpenHands Installation
Development workflow
OpenHands Version
main branch from 2025-01-08
Operating System
WSL on Windows
Logs, Errors, Screenshots, and Additional Context
.
There is another problem:
The port is obtained this way:
def _attach_to_container(self):
self._container_port = 0
self.container = self.docker_client.containers.get(self.container_name)
for port in self.container.attrs['NetworkSettings']['Ports']: # type: ignore
self._container_port = int(port.split('/')[0])
break
But the sandbox containers don't expose ports:
$ docker inspect fa3ff536487c | jq '.[0].NetworkSettings.Ports'
{}
But checking the /alive endpoint returns status: ok.
Another problem is that the containers are not restarted. Doing "docker start" fixes the problem.
But the sandbox containers don't expose ports
I confirmed this issue can be fixed with https://github.com/All-Hands-AI/OpenHands/pull/6080
@kripper can you explain how you reproduce this. I tried this:
- Run OpenHands
- Fill out settings
- Prompt "4 + 5"
- Ctrl + C terminal to kill OpenHands
- Run OpenHands
- Press "Jump back to recent conversation"
- Prompt "add 3"
@kripper The stack trace you posted on https://github.com/All-Hands-AI/OpenHands/pull/6114 has a line that seems critical to me:
No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'
Can you confirm that:
- The sessions directory exists in your test environment?
- There are sessions within it?
- They contain files that look something like this:
@tofarr
Ctrl + C terminal to kill OpenHands
I think this means there was no agent_state.pkl created, because it's saved when the controller closes normally. The /events are saved, of course. I'm not sure when metadata.json is saved.
We recently solved an issue when the pickle doesn't exist (due to runtime errors etc, it can be missing)
- we should still be able to restore history: https://github.com/All-Hands-AI/OpenHands/pull/5946
I think conversation loading also should ideally not depend on the existence of this file?
Previous issues were fixed applying https://github.com/All-Hands-AI/OpenHands/pull/6114.
But there is still present this one last issue, preventing to re-join conversations after an unclean exit or reboot of the box (I use to restart the OH container using Docker Desktop).
It's not critical, but it's worth to report here for mental health.
I can reproduce this bug consistently.
It happens only the first time I execute make run after a forced (unclean) reboot of the box.
When I execute "make run" the second time it works fine (and so on).
I compared /tmp files and running processes before and after the first make run and there was nothing suspicious there.
Mabye OH creates some lock file or similar that must be cleaned on exit.
If I interrupt make run using CTRL+C (clean exit), reboot and make run , I can join conversations without problem.
Thus, this issue ocurrs only after OH was uncleanly terminated.
This is the stacktrace that is generated after trying to join a conversation after the first make run:
21:31:01 - openhands:INFO: docker_runtime.py:147 - [runtime aa4cca868229460fb319c227be9c65db] Waiting for client to become ready at http://localhost:0...
21:31:01 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
Traceback (most recent call last):
File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
await self.runtime.connect()
File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
await call_sync_from_async(self._wait_until_alive)
File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
result = await coro
^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
return copy(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
result = action(retry_state)
^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.INIT to AgentState.FINISHED
21:31:02 - openhands:ERROR: manager.py:209 - Error connecting to conversation aa4cca868229460fb319c227be9c65db: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
Is this still reproducible with 0.21?
No, it's still not working. When restarting OH and trying to rejoin a conversation, it doesn't get the correct port:
requests.exceptions.InvalidURL: Failed to parse: http://localhost:-1/alive
There was a PR that fixes this.
Also, OH is not starting the sandboxes.
There was a PR that fixes this.
See https://github.com/All-Hands-AI/OpenHands/pull/6080
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for over 30 days with no activity.