OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

[Bug]: OH fails to join existing conversations after an unclean exit

Open kripper opened this issue 1 year ago • 7 comments

[EDIT] Skip this and go directly to https://github.com/All-Hands-AI/OpenHands/issues/6148#issuecomment-2578502601

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Describe the bug and reproduction steps

This happens because join_conversation() is not calling self.maybe_start_agent_loop which starts the sandbox container, when event_stream is True:

event_stream = await self._get_event_stream(sid)
if not event_stream:
    return await self.maybe_start_agent_loop(sid, settings)

OpenHands Installation

Development workflow

OpenHands Version

main branch from 2025-01-08

Operating System

WSL on Windows

Logs, Errors, Screenshots, and Additional Context

.

kripper avatar Jan 08 '25 16:01 kripper

There is another problem:

The port is obtained this way:

def _attach_to_container(self):
    self._container_port = 0
    self.container = self.docker_client.containers.get(self.container_name)
    for port in self.container.attrs['NetworkSettings']['Ports']:  # type: ignore
        self._container_port = int(port.split('/')[0])
        break

But the sandbox containers don't expose ports:

$ docker inspect fa3ff536487c | jq '.[0].NetworkSettings.Ports'
{}

But checking the /alive endpoint returns status: ok.

kripper avatar Jan 08 '25 16:01 kripper

Another problem is that the containers are not restarted. Doing "docker start" fixes the problem.

kripper avatar Jan 08 '25 17:01 kripper

But the sandbox containers don't expose ports

I confirmed this issue can be fixed with https://github.com/All-Hands-AI/OpenHands/pull/6080

kripper avatar Jan 08 '25 17:01 kripper

@kripper can you explain how you reproduce this. I tried this:

  1. Run OpenHands
  2. Fill out settings
  3. Prompt "4 + 5"
  4. Ctrl + C terminal to kill OpenHands
  5. Run OpenHands
  6. Press "Jump back to recent conversation"
  7. Prompt "add 3" Screenshot 2025-01-08 at 2 15 54 PM

mamoodi avatar Jan 08 '25 19:01 mamoodi

@kripper The stack trace you posted on https://github.com/All-Hands-AI/OpenHands/pull/6114 has a line that seems critical to me:

No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'

Can you confirm that:

  • The sessions directory exists in your test environment?
  • There are sessions within it?
  • They contain files that look something like this: image

tofarr avatar Jan 08 '25 19:01 tofarr

@tofarr

Ctrl + C terminal to kill OpenHands

I think this means there was no agent_state.pkl created, because it's saved when the controller closes normally. The /events are saved, of course. I'm not sure when metadata.json is saved.

We recently solved an issue when the pickle doesn't exist (due to runtime errors etc, it can be missing)

  • we should still be able to restore history: https://github.com/All-Hands-AI/OpenHands/pull/5946

I think conversation loading also should ideally not depend on the existence of this file?

enyst avatar Jan 08 '25 19:01 enyst

Previous issues were fixed applying https://github.com/All-Hands-AI/OpenHands/pull/6114.

But there is still present this one last issue, preventing to re-join conversations after an unclean exit or reboot of the box (I use to restart the OH container using Docker Desktop).

It's not critical, but it's worth to report here for mental health.

I can reproduce this bug consistently.

It happens only the first time I execute make run after a forced (unclean) reboot of the box. When I execute "make run" the second time it works fine (and so on). I compared /tmp files and running processes before and after the first make run and there was nothing suspicious there. Mabye OH creates some lock file or similar that must be cleaned on exit.

If I interrupt make run using CTRL+C (clean exit), reboot and make run , I can join conversations without problem.

Thus, this issue ocurrs only after OH was uncleanly terminated.

This is the stacktrace that is generated after trying to join a conversation after the first make run:

21:31:01 - openhands:INFO: docker_runtime.py:147 - [runtime aa4cca868229460fb319c227be9c65db] Waiting for client to become ready at http://localhost:0...
21:31:01 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
Traceback (most recent call last):
  File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
    await self.runtime.connect()
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
    await call_sync_from_async(self._wait_until_alive)
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
    raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.INIT to AgentState.FINISHED
21:31:02 - openhands:ERROR: manager.py:209 - Error connecting to conversation aa4cca868229460fb319c227be9c65db: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.

kripper avatar Jan 08 '25 19:01 kripper

Is this still reproducible with 0.21?

mamoodi avatar Jan 22 '25 19:01 mamoodi

No, it's still not working. When restarting OH and trying to rejoin a conversation, it doesn't get the correct port:

requests.exceptions.InvalidURL: Failed to parse: http://localhost:-1/alive

There was a PR that fixes this.

kripper avatar Jan 22 '25 21:01 kripper

Also, OH is not starting the sandboxes.

kripper avatar Jan 22 '25 21:01 kripper

There was a PR that fixes this.

See https://github.com/All-Hands-AI/OpenHands/pull/6080

kripper avatar Jan 22 '25 21:01 kripper

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Feb 22 '25 01:02 github-actions[bot]

This issue was closed because it has been stalled for over 30 days with no activity.

github-actions[bot] avatar Mar 01 '25 02:03 github-actions[bot]