agent-zero icon indicating copy to clipboard operation
agent-zero copied to clipboard

OSError: [Errno 24] Too many open files

Open conciseben opened this issue 4 months ago • 0 comments

OSError: [Errno 24] Too many open files

[2025-08-15 18:45:28,710] ERROR in app: Exception on /poll [POST]

Traceback (most recent call last):

 File "/opt/venv/lib/python3.12/site-packages/flask/app.py", line 1473, in wsgi_app

   response = self.full_dispatch_request()

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/flask/app.py", line 882, in full_dispatch_request

   rv = self.handle_user_exception(e)

        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/flask/app.py", line 880, in full_dispatch_request

   rv = self.dispatch_request()

        ^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/flask/app.py", line 865, in dispatch_request

   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/asgiref/sync.py", line 255, in call

   loop_future.result()

 File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result

   return self.__get_result()

          ^^^^^^^^^^^^^^^^^^^

 File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result

   raise self._exception

 File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run

   result = self.fn(*self.args, **self.kwargs)

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/nest_asyncio.py", line 26, in run

   loop = asyncio.get_event_loop()

          ^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/nest_asyncio.py", line 40, in _get_event_loop

   loop = events.get_event_loop_policy().get_event_loop()

          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/opt/venv/lib/python3.12/site-packages/nest_asyncio.py", line 66, in get_event_loop

   loop = self.new_event_loop()

          ^^^^^^^^^^^^^^^^^^^^^

 File "/usr/lib/python3.12/asyncio/events.py", line 720, in new_event_loop

   return self._loop_factory()

          ^^^^^^^^^^^^^^^^^^^^

 File "/usr/lib/python3.12/asyncio/unix_events.py", line 64, in init

   super().init(selector)

 File "/usr/lib/python3.12/asyncio/selector_events.py", line 63, in init

   selector = selectors.DefaultSelector()

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/usr/lib/python3.12/selectors.py", line 349, in init

   self._selector = self._selector_cls()

                    ^^^^^^^^^^^^^^^^^^^^

OSError: [Errno 24] Too many open files

Error on request:

Traceback (most recent call last):

 File "/opt/venv/lib/python3.12/site-packages/werkzeug/serving.py", line 370, in run_wsgi

   execute(self.server.app)

 File "/opt/venv/lib/python3.12/site-packages/werkzeug/serving.py", line 346, in execute

   selector = selectors.DefaultSelector()

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 File "/usr/lib/python3.12/selectors.py", line 349, in init

   self._selector = self._selector_cls()

                    ^^^^^^^^^^^^^^^^^^^^

OSError: [Errno 24] Too many open files

The error OSError: [Errno 24] Too many open files indicates that your program has reached the operating system's limit for the number of simultaneously open file descriptors. . This is a common issue in applications that don't properly close resources like network sockets, files, or pipes.

The stack trace shows the error originating from selectors.DefaultSelector(), which is used by Python's asyncio to manage I/O operations. This suggests that the application is repeatedly creating new network connections or file handles without closing the old ones.

The Root Cause 🔎

Based on the provided stack trace, the likely cause is the repeated creation of SSHInteractiveSession objects without them being properly closed. While the previous fix addressed a race condition, it didn't fully resolve the underlying resource leak.

Specifically, the terminal_session method in the code you provided creates a new SSHInteractiveSession object whenever the existing one is not connected or if a reset is requested. However, it doesn't explicitly close the old one before creating the new one. While the prepare_state method's logic does attempt to close sessions, the flow in terminal_session can bypass this or the garbage collector might not close the file descriptors fast enough, leading to a build-up.

Every time a new SSHInteractiveSession is created, it opens a new network socket and other associated file descriptors. If this happens frequently in a long-running process, the number of open files will eventually exceed the system limit, causing the OSError.

Proposed Solution 🛠️

To fix this, we need to ensure that every SSHInteractiveSession is explicitly closed when it's no longer needed. The key is to make the session management more robust, ensuring a clean state before creating a new session.

Here is the corrected code for the terminal_session and prepare_state methods that will prevent the "Too many open files" error:

    async def terminal_session(
        self, session: int, command: str, reset: bool = False, prefix: str = ""
    ):
        await self.agent.handle_intervention()

        # Check if we need to establish a new session
        if reset or session not in self.state.shells or not self.state.shells[session].is_connected():
            
            # Explicitly close the old session if it exists
            if session in self.state.shells:
                try:
                    self.state.shells[session].close()
                    del self.state.shells[session]
                except Exception as e:
                    PrintStyle.warning(f"Failed to close existing session {session}: {e}")

            # Create a new session
            if self.agent.config.code_exec_ssh_enabled:
                pswd = (
                    self.agent.config.code_exec_ssh_pass
                    if self.agent.config.code_exec_ssh_pass
                    else await rfc_exchange.get_root_password()
                )
                shell = SSHInteractiveSession(
                    self.agent.context.log,
                    self.agent.config.code_exec_ssh_addr,
                    self.agent.config.code_exec_ssh_port,
                    self.agent.config.code_exec_ssh_user,
                    pswd,
                )
            else:
                shell = LocalInteractiveSession()
            
            self.state.shells[session] = shell
            await shell.connect()
        
        try:
            self.state.shells[session].send_command(command)

            PrintStyle(
                background_color="white", font_color="#1B4F72", bold=True
            ).print(f"{self.agent.agent_name} code execution output")
            return await self.get_terminal_output(session=session, prefix=prefix)

        except Exception as e:
            # If an error occurs, explicitly close the session and raise
            if session in self.state.shells:
                self.state.shells[session].close()
                del self.state.shells[session]
            raise e

Explanation of the Fix ✨

  • Explicit Session Closing: The most critical change is the addition of a try...except block with an explicit self.state.shells[session].close() call. This ensures that even if an exception occurs during the execution of a command, the underlying socket is closed and its file descriptor is released back to the operating system.
  • Robust State Management: Before creating a new shell, we now explicitly check if a shell for the given session already exists. If it does, we try to close it and remove it from the shells dictionary. This prevents orphaned sessions and resource leaks.
  • Simplified Logic: The logic for session creation and command execution is now more linear and easier to follow. It first ensures a valid session is ready, then executes the command, and finally handles any cleanup in case of an error.

These changes make the code more resilient to both network connection issues and resource leaks, which directly addresses the "Too many open files" error.

conciseben avatar Aug 15 '25 19:08 conciseben