OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Fix closing sessions

Open tofarr opened this issue 11 months ago • 10 comments

End-user friendly description of the problem this fixes or functionality that this introduces

This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.

  • [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.


Summary of Changes

  • Added user_id tracking to sessions for better user-specific resource management
  • Implemented proper closing of stale sessions to prevent resource leaks
  • Added "agent stopped" event emission for better frontend state management
  • Enhanced recovery mechanism to preserve workspace/files after disconnection
  • Added smart session management for handling multiple conversations

Acceptance Criteria for Multi-conversation Runtime Management

Recovery

  • Start a conversation
  • Disconnect
  • Restart the server
  • Verify workspace/files are preserved

Conversation Limits

  • Start 4 conversations in different tabs
  • First conversation goes to "agent stopped"
  • Sending a new message starts it back up, and another conversation goes to "agent stopped"
  • Verify workspace is totally recovered

Testing Instructions

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:6f677aa-nikolaik   --name openhands-app-6f677aa   docker.all-hands.dev/all-hands-ai/openhands:6f677aa
  -p 3000:3000 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --add-host host.docker.internal:host-gateway \
  -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2a0de2-nikolaik \
  --name openhands-app-b2a0de2 \
  docker.all-hands.dev/all-hands-ai/openhands:b2a0de2

tofarr avatar Jan 07 '25 15:01 tofarr

Does this solve https://github.com/All-Hands-AI/OpenHands/issues/6148 ?

kripper avatar Jan 08 '25 17:01 kripper

Does this solve #6148 ?

This works in my tests

tofarr avatar Jan 08 '25 18:01 tofarr

This works in my tests

Ok, I'm testing locally right now.

kripper avatar Jan 08 '25 18:01 kripper

This is going to be more fun, thank you!

I didn't have time to test it (my bad, I'm always a bit concerned that these saas focused changes may break local), but I assume you test locally too 😅

Local testing will always be a priority and is tested before anything even hits SAAS. (My standard procedure is to test without clustering, then test with clustering, then test in SAAS)

tofarr avatar Jan 08 '25 18:01 tofarr

I tested this:

  • I created a conversation/sandbox (worked fine).
  • I restarted the OH server
  • I joined the conversation URL

It failed, because the container couldn't be started.

Some remarks:

  • http://localhost:0 <-- it didn't get the port. See the fix here: https://github.com/All-Hands-AI/OpenHands/issues/6148
  • This file was missing: No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'

Logs:

18:55:55 - openhands:INFO: docker_runtime.py:147 - [runtime e770430539174979bf2296e8c6d3fde5] Waiting for client to become ready at http://localhost:0...
18:55:55 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
Traceback (most recent call last):
  File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
    await self.runtime.connect()
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
    await call_sync_from_async(self._wait_until_alive)
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
    raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
18:55:55 - openhands:WARNING: agent_session.py:295 - State could not be restored: [Errno 2] No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO:     127.0.0.1:39866 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/vscode-url HTTP/1.1" 404 Not Found
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO:     127.0.0.1:39854 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/list-files HTTP/1.1" 404 Not Found

kripper avatar Jan 08 '25 19:01 kripper

@kripper - I think your issue may actually be unrelated (And sounds like a config issue), so I'm gonna respond on the ticket you opened

tofarr avatar Jan 08 '25 19:01 tofarr

I can reproduce consistently:

It happens only the first time I execute "make run" after rebooting the box. When I execute "make run" the second time it works fine (and so on).

Maybe something in /tmp? /tmp is erased after reboot.

kripper avatar Jan 08 '25 19:01 kripper

Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive

rbren avatar Jan 08 '25 19:01 rbren

I confirm this PR works fine and that https://github.com/All-Hands-AI/OpenHands/issues/6148#issuecomment-2578502601 is unrelated.

kripper avatar Jan 08 '25 21:01 kripper

Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive

I've reduced the default here to 15 seconds

tofarr avatar Jan 09 '25 23:01 tofarr

Comments here are non-blocking.

Let's just make sure the latest commit has been thoroughly tested (especially in a multi-replica mode, and especially with multiple users connected) before merging

rbren avatar Jan 15 '25 16:01 rbren

Looks like this cause /download endpoint to stop working

e.g., This commit adds new testcases that relies on copy_from method that uses /download.

CI error: https://github.com/All-Hands-AI/OpenHands/actions/runs/12793391296/job/35666481217

I just locally reproduced this, it seems after reverting this PR, the test started to work. With this commits, the zip downloaded from runtime has a size of 0B.

xingyaoww avatar Jan 15 '25 22:01 xingyaoww

@tofarr https://github.com/All-Hands-AI/OpenHands/issues/6382

kripper avatar Jan 21 '25 09:01 kripper