agents icon indicating copy to clipboard operation
agents copied to clipboard

Worker processes dying silently after prewarm, causing all job requests to fail

Open ClvHbl opened this issue 3 weeks ago • 3 comments

Bug Description

Our agent was running stable in production for weeks, then suddenly all job requests started failing with DuplexClosed errors. Redeploying the exact same code without any changes seems to have resolved the issue, job requests are now succeeding.

Error sequence from logs:

worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed?

{"message": "unexpected message type: 258", "level": "WARNING", "name": "livekit.agents", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:15:17.155569+00:00"}

{"message": "Running <Task finished name='Task-66' coro=<Worker._run_ws.._recv_task() done, defined at /usr/local/lib/python3.11/site-packages/livekit/agents/worker.py:730> exception=Exception('worker connection closed unexpectedly')> took too long: 127.05 seconds", "level": "WARNING", "name": "livekit.agents", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.210177+00:00"}

{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=6889)", "pid": 6889, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.239745+00:00"}

{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=3205)", "pid": 3205, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.241383+00:00"}

{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=4669)", "pid": 4669, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.242096+00:00"}

{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=8789)", "pid": 8789, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.242697+00:00"}

{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=138)", "pid": 138, "inference": true, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.243402+00:00"}

{"message": "failed to connect to livekit, retrying in 0s", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 689, in _connection_task\n await self._run_ws(ws)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 770, in _run_ws\n await asyncio.gather(*tasks)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 742, in _recv_task\n raise Exception("worker connection closed unexpectedly")\nException: worker connection closed unexpectedly", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.244395+00:00"}

{"message": "registered worker", "level": "INFO", "name": "livekit.agents", "id": "CAW_QVAqNAHneLpW", "url": "https://mirror-prod-frs6nvwg.livekit.cloud", "region": "US East B", "protocol": 16, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-08T17:17:24.291772+00:00"}

{"message": "failed to connect to livekit, retrying in 0s", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 689, in _connection_task\n await self._run_ws(ws)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 770, in _run_ws\n await asyncio.gather(*tasks)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 742, in _recv_task\n raise Exception("worker connection closed unexpectedly")\nException: worker connection closed unexpectedly", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T20:30:09.935352+00:00"}

{"message": "registered worker", "level": "INFO", "name": "livekit.agents", "id": "CAW_QVAqNAHneLpW", "url": "https://mirror-prod-frs6nvwg.livekit.cloud", "region": "US East B", "protocol": 16, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-08T20:30:10.235292+00:00"}

{"message": "received job request", "level": "INFO", "name": "livekit.agents", "job_id": "AJ_hYMchA3iAzbr", "dispatch_id": "AD_DBwZPPrSMgk6", "room_name": "agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762730991665_kjhky1n", "agent_name": "", "resuming": false, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:52.987779+00:00"}

{"message": "job_request_fnc failed", "level": "ERROR", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/utils/aio/duplex_unix.py", line 50, in send_bytes\n await self._writer.drain()\n File "/usr/local/lib/python3.11/asyncio/streams.py", line 392, in drain\n await self._protocol._drain_helper()\n File "/usr/local/lib/python3.11/asyncio/streams.py", line 166, in _drain_helper\n raise ConnectionResetError('Connection lost')\nConnectionResetError: Connection lost\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 886, in _job_request_task\n await self._opts.request_fnc(job_req)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 69, in _default_request_fnc\n await ctx.accept()\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/job.py", line 548, in accept\n await self._on_accept(accept_arguments)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 868, in _on_accept\n await self._proc_pool.launch_job(running_info)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/proc_pool.py", line 113, in launch_job\n await proc.launch_job(info)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/job_proc_executor.py", line 152, in launch_job\n await channel.asend_message(self._pch, start_req)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/channel.py", line 51, in asend_message\n await dplx.send_bytes(_write_message(msg))\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/utils/aio/duplex_unix.py", line 52, in send_bytes\n raise DuplexClosed() from e\nlivekit.agents.utils.aio.duplex_unix.DuplexClosed", "job_request": "<livekit.agents.job.JobRequest object at 0x7fea54b6c8d0>", "agent_name": "", "service": "livekit-agent", "severity": "ERROR", "timestamp": "2025-11-09T23:29:53.037891+00:00"}

{"message": "initializing process", "level": "INFO", "name": "livekit.agents", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.056586+00:00"}

{"message": "▶ Worker process 9810 starting prewarm", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.318137+00:00"}

{"message": "🔥 Prewarming inference executor...", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.318290+00:00"}

{"message": "process initialized", "level": "INFO", "name": "livekit.agents", "pid": 9810, "elapsed_time": 0.32, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.373195+00:00"}

{"message": "VAD model loaded and cached", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.372681+00:00"}

{"message": "✅ Worker process 9810 prewarm complete", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.372799+00:00"}

Expected Behavior

Prewarmed worker processes should remain alive and ready to handle incoming job requests. When a job arrives, the main process should successfully dispatch it to an idle prewarmed worker without connection errors.

Reproduction Steps

I am unable to reproduce it myself, it has been running for weeks on LiveKit's cloud without any troubles. 

Deployment:
  - Deploy to LiveKit Cloud using lk agent deploy via Cloud Build
  - Agent configured with 4 idle processes (num_idle_processes=4)
  - Memory limits: 500MB warn, 1GB limit per process (when I do that it's telling me "custom load_threshold is not supported when hosting on Cloud, reverting to default" do you think this could cause the issue? What am I supposed to put there then?")

Prewarm function:
  def prewarm(proc: JobProcess):
      setup_eou_logging()  # Logging setup
      proc.userdata["vad"] = load_vad_model_singleton()  # Load Silero VAD model
      proc.userdata["google_plugin_available"] = GOOGLE_PLUGIN_AVAILABLE

Entrypoint function:
  - Parses room name for agent_id/session_id
  - Initializes Redis logging
  - Sets up event monitoring
  - Registers room event handlers
  - Runs agent session

Logging setup:

Axiom logging configured at module level (main process only, inherited by workers):
if mp.current_process().name == 'MainProcess':
  axiom_client = axiom_py.Client(axiom_token)
  axiom_handler = AxiomHandler(axiom_client, axiom_dataset)
  root_logger.addHandler(axiom_handler)
else:
  # Child process - handler will be inherited via fork
  root_logger.setLevel(logging.INFO)

Redis logging initialized in entrypoint when job starts:
async def entrypoint(ctx: JobContext):
  # ... parse room name, initialize monitoring ...
  setup_redis_logging(job_id=job_id, room_id=room_id, session_id=session_id)
  logger.info(f"✅ Redis log handler initialized for job {job_id}, session {session_id}")

Operating System

LiveKit Cloud managed infrastructure, joining from a MacBook Pro in chrome.

Models Used

Don't think it's relevant here but Nova-3, GPT 4.1, Google Chirp 3 HD

Package Versions

Backend: 

Python: 3.11
livekit: 1.0.12
livekit-agents: 1.2.14
livekit-api: 1.0.5
livekit-protocol: 1.0.4
livekit-plugins-silero: 1.2.14 (VAD)
livekit-plugins-openai: 1.2.14
livekit-plugins-anthropic: 1.2.14
livekit-plugins-deepgram: 1.2.14
livekit-plugins-cartesia: 1.2.14
livekit-plugins-elevenlabs: 1.2.14
livekit-plugins-google: 1.2.14

Client SDK:

livekit-client: 2.15.14 (React/TypeScript frontend)

Session/Room/Call IDs

Here are examples of rooms that failed, what is interesting is that there doesn't seem to be any agent in comparison to one that succeeds:

Failed:

  • Room: agent_7f532e0b-66ef-448d-885e-4e5152dbb204_session_session_1762731129067_3d3tf2z

  • Room session ID: RM_Kyc92XrgWN6g

  • Room: agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762731096451_3wjnd7n

  • Room session ID: RM_7RcfQg4jyzVn

Succeeded (after redeployment):

  • Room: agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762732720714_5bo8yka
  • Room session ID: RM_4AFHQUrDxdZY

Proposed Solution


Additional Context

Here is my dockerfile: FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu ||
pip install --no-cache-dir torch==2.3.1 RUN pip install --no-cache-dir --upgrade -r requirements.txt COPY . . RUN python main.py download-files EXPOSE 8081 ENV LIVEKIT_NUM_PROCESSES=4 ENV LIVEKIT_LOAD_THRESHOLD=0.7 ENV LIVEKIT_MEMORY_WARN_MB=1000 ENV LIVEKIT_MEMORY_LIMIT_MB=2000 (could that be an issue? I am getting warnings when starting up the service, what should I put?) CMD ["python", "main.py", "start"]

ClvHbl avatar Nov 10 '25 00:11 ClvHbl