Fix closing sessions
End-user friendly description of the problem this fixes or functionality that this introduces
This PR improves the handling of multiple conversations and session management in OpenHands. It ensures that user workspaces are preserved even after disconnections or server restarts, and implements a smart session management system that automatically handles conversation limits.
- [x] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Improved multi-conversation support with automatic session management and workspace preservation. Users can now maintain multiple conversations across different tabs while ensuring their work is preserved, even after disconnections or server restarts.
Summary of Changes
- Added user_id tracking to sessions for better user-specific resource management
- Implemented proper closing of stale sessions to prevent resource leaks
- Added "agent stopped" event emission for better frontend state management
- Enhanced recovery mechanism to preserve workspace/files after disconnection
- Added smart session management for handling multiple conversations
Acceptance Criteria for Multi-conversation Runtime Management
Recovery
- Start a conversation
- Disconnect
- Restart the server
- Verify workspace/files are preserved
Conversation Limits
- Start 4 conversations in different tabs
- First conversation goes to "agent stopped"
- Sending a new message starts it back up, and another conversation goes to "agent stopped"
- Verify workspace is totally recovered
Testing Instructions
To run this PR locally, use the following command:
docker run -it --rm -p 3000:3000 -v /var/run/docker.sock:/var/run/docker.sock --add-host host.docker.internal:host-gateway -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:6f677aa-nikolaik --name openhands-app-6f677aa docker.all-hands.dev/all-hands-ai/openhands:6f677aa
-p 3000:3000 \
-v /var/run/docker.sock:/var/run/docker.sock \
--add-host host.docker.internal:host-gateway \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b2a0de2-nikolaik \
--name openhands-app-b2a0de2 \
docker.all-hands.dev/all-hands-ai/openhands:b2a0de2
Does this solve https://github.com/All-Hands-AI/OpenHands/issues/6148 ?
Does this solve #6148 ?
This works in my tests
This works in my tests
Ok, I'm testing locally right now.
This is going to be more fun, thank you!
I didn't have time to test it (my bad, I'm always a bit concerned that these saas focused changes may break local), but I assume you test locally too 😅
Local testing will always be a priority and is tested before anything even hits SAAS. (My standard procedure is to test without clustering, then test with clustering, then test in SAAS)
I tested this:
- I created a conversation/sandbox (worked fine).
- I restarted the OH server
- I joined the conversation URL
It failed, because the container couldn't be started.
Some remarks:
http://localhost:0<-- it didn't get the port. See the fix here: https://github.com/All-Hands-AI/OpenHands/issues/6148- This file was missing:
No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'
Logs:
18:55:55 - openhands:INFO: docker_runtime.py:147 - [runtime e770430539174979bf2296e8c6d3fde5] Waiting for client to become ready at http://localhost:0...
18:55:55 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
Traceback (most recent call last):
File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
await self.runtime.connect()
File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
await call_sync_from_async(self._wait_until_alive)
File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
result = await coro
^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
return copy(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
result = action(retry_state)
^^^^^^^^^^^^^^^^^^^
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
18:55:55 - openhands:WARNING: agent_session.py:295 - State could not be restored: [Errno 2] No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
18:55:55 - openhands:INFO: agent_controller.py:388 - [Agent Controller e770430539174979bf2296e8c6d3fde5] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO: 127.0.0.1:39866 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/vscode-url HTTP/1.1" 404 Not Found
18:55:55 - openhands:ERROR: manager.py:209 - Error connecting to conversation e770430539174979bf2296e8c6d3fde5: Container openhands-runtime-e770430539174979bf2296e8c6d3fde5 has exited.
INFO: 127.0.0.1:39854 - "GET /api/conversations/e770430539174979bf2296e8c6d3fde5/list-files HTTP/1.1" 404 Not Found
@kripper - I think your issue may actually be unrelated (And sounds like a config issue), so I'm gonna respond on the ticket you opened
I can reproduce consistently:
It happens only the first time I execute "make run" after rebooting the box. When I execute "make run" the second time it works fine (and so on).
Maybe something in /tmp?
/tmp is erased after reboot.
Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive
I confirm this PR works fine and that https://github.com/All-Hands-AI/OpenHands/issues/6148#issuecomment-2578502601 is unrelated.
Should we set close_delay to something much smaller, now that we're checking if the agent is running? Might as well be more aggressive
I've reduced the default here to 15 seconds
Comments here are non-blocking.
Let's just make sure the latest commit has been thoroughly tested (especially in a multi-replica mode, and especially with multiple users connected) before merging
Looks like this cause /download endpoint to stop working
e.g., This commit adds new testcases that relies on copy_from method that uses /download.
CI error: https://github.com/All-Hands-AI/OpenHands/actions/runs/12793391296/job/35666481217
I just locally reproduced this, it seems after reverting this PR, the test started to work. With this commits, the zip downloaded from runtime has a size of 0B.
@tofarr https://github.com/All-Hands-AI/OpenHands/issues/6382