agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

Refactor LLMProxy to run Uvicorn in isolated process

Open beanie00 opened this issue 2 months ago • 1 comments

Summary

This PR refactors the LLMProxy.start() logic to launch the Uvicorn proxy server in a fully isolated process using multiprocessing.spawn. The previous implementation ran the server in a background thread, which caused persistent connection and transport errors when Ray forked workers or when LiteLLM reused existing event loops.

Motivation

In the previous design, the proxy shared its asyncio loop and open network sockets with the parent process. When Ray forked new workers or reused existing aiohttp.ClientSession objects, the same TCP socket descriptors were inherited in an invalid state. This led to repeated I/O failures like:

aiohttp.client_exceptions.ClientConnectionResetError: Cannot write to closing transport
openai.APIConnectionError: Connection error.
litellm.llms.openai.common_utils.OpenAIError: Connection error.

These errors occurred because aiohttp tried to write to a “closing transport”, a socket that had already been closed in the parent process or invalidated by fork().

Changes

  • ✅Run the LLMProxy Uvicorn server in a dedicated subprocess (spawn mode) to ensure a clean event loop and socket set.
  • ✅Replace the thread-based readiness logic with a socket-level health check.
  • ✅Cleanly isolate OpenTelemetry and aiohttp sessions between parent and child processes.

beanie00 avatar Oct 31 '25 13:10 beanie00