agents
agents copied to clipboard
Worker processes dying silently after prewarm, causing all job requests to fail
Bug Description
Our agent was running stable in production for weeks, then suddenly all job requests started failing with DuplexClosed errors. Redeploying the exact same code without any changes seems to have resolved the issue, job requests are now succeeding.
Error sequence from logs:
worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed? worker process is not responding.. worker crashed?
{"message": "unexpected message type: 258", "level": "WARNING", "name": "livekit.agents", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:15:17.155569+00:00"}
{"message": "Running <Task finished name='Task-66' coro=<Worker._run_ws.
{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=6889)", "pid": 6889, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.239745+00:00"}
{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=3205)", "pid": 3205, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.241383+00:00"}
{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=4669)", "pid": 4669, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.242096+00:00"}
{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=8789)", "pid": 8789, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.242697+00:00"}
{"message": "Failed to get memory info for process", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/supervised_proc.py", line 361, in _memory_monitor_task\n process = psutil.Process(self._pid)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 313, in init\n self._init(pid)\n File "/usr/local/lib/python3.11/site-packages/psutil/init.py", line 359, in _init\n raise NoSuchProcess(pid, msg=msg) from None\npsutil.NoSuchProcess: process PID not found (pid=138)", "pid": 138, "inference": true, "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.243402+00:00"}
{"message": "failed to connect to livekit, retrying in 0s", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 689, in _connection_task\n await self._run_ws(ws)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 770, in _run_ws\n await asyncio.gather(*tasks)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 742, in _recv_task\n raise Exception("worker connection closed unexpectedly")\nException: worker connection closed unexpectedly", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T17:17:24.244395+00:00"}
{"message": "registered worker", "level": "INFO", "name": "livekit.agents", "id": "CAW_QVAqNAHneLpW", "url": "https://mirror-prod-frs6nvwg.livekit.cloud", "region": "US East B", "protocol": 16, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-08T17:17:24.291772+00:00"}
{"message": "failed to connect to livekit, retrying in 0s", "level": "WARNING", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 689, in _connection_task\n await self._run_ws(ws)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 770, in _run_ws\n await asyncio.gather(*tasks)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 742, in _recv_task\n raise Exception("worker connection closed unexpectedly")\nException: worker connection closed unexpectedly", "service": "livekit-agent", "severity": "WARNING", "timestamp": "2025-11-08T20:30:09.935352+00:00"}
{"message": "registered worker", "level": "INFO", "name": "livekit.agents", "id": "CAW_QVAqNAHneLpW", "url": "https://mirror-prod-frs6nvwg.livekit.cloud", "region": "US East B", "protocol": 16, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-08T20:30:10.235292+00:00"}
{"message": "received job request", "level": "INFO", "name": "livekit.agents", "job_id": "AJ_hYMchA3iAzbr", "dispatch_id": "AD_DBwZPPrSMgk6", "room_name": "agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762730991665_kjhky1n", "agent_name": "", "resuming": false, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:52.987779+00:00"}
{"message": "job_request_fnc failed", "level": "ERROR", "name": "livekit.agents", "exc_info": "Traceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/utils/aio/duplex_unix.py", line 50, in send_bytes\n await self._writer.drain()\n File "/usr/local/lib/python3.11/asyncio/streams.py", line 392, in drain\n await self._protocol._drain_helper()\n File "/usr/local/lib/python3.11/asyncio/streams.py", line 166, in _drain_helper\n raise ConnectionResetError('Connection lost')\nConnectionResetError: Connection lost\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 886, in _job_request_task\n await self._opts.request_fnc(job_req)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 69, in _default_request_fnc\n await ctx.accept()\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/job.py", line 548, in accept\n await self._on_accept(accept_arguments)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/worker.py", line 868, in _on_accept\n await self._proc_pool.launch_job(running_info)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/proc_pool.py", line 113, in launch_job\n await proc.launch_job(info)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/job_proc_executor.py", line 152, in launch_job\n await channel.asend_message(self._pch, start_req)\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/ipc/channel.py", line 51, in asend_message\n await dplx.send_bytes(_write_message(msg))\n File "/usr/local/lib/python3.11/site-packages/livekit/agents/utils/aio/duplex_unix.py", line 52, in send_bytes\n raise DuplexClosed() from e\nlivekit.agents.utils.aio.duplex_unix.DuplexClosed", "job_request": "<livekit.agents.job.JobRequest object at 0x7fea54b6c8d0>", "agent_name": "", "service": "livekit-agent", "severity": "ERROR", "timestamp": "2025-11-09T23:29:53.037891+00:00"}
{"message": "initializing process", "level": "INFO", "name": "livekit.agents", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.056586+00:00"}
{"message": "▶ Worker process 9810 starting prewarm", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.318137+00:00"}
{"message": "🔥 Prewarming inference executor...", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.318290+00:00"}
{"message": "process initialized", "level": "INFO", "name": "livekit.agents", "pid": 9810, "elapsed_time": 0.32, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.373195+00:00"}
{"message": "VAD model loaded and cached", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.372681+00:00"}
{"message": "✅ Worker process 9810 prewarm complete", "level": "INFO", "name": "livekit-agent", "pid": 9810, "service": "livekit-agent", "severity": "INFO", "timestamp": "2025-11-09T23:29:53.372799+00:00"}
Expected Behavior
Prewarmed worker processes should remain alive and ready to handle incoming job requests. When a job arrives, the main process should successfully dispatch it to an idle prewarmed worker without connection errors.
Reproduction Steps
I am unable to reproduce it myself, it has been running for weeks on LiveKit's cloud without any troubles.
Deployment:
- Deploy to LiveKit Cloud using lk agent deploy via Cloud Build
- Agent configured with 4 idle processes (num_idle_processes=4)
- Memory limits: 500MB warn, 1GB limit per process (when I do that it's telling me "custom load_threshold is not supported when hosting on Cloud, reverting to default" do you think this could cause the issue? What am I supposed to put there then?")
Prewarm function:
def prewarm(proc: JobProcess):
setup_eou_logging() # Logging setup
proc.userdata["vad"] = load_vad_model_singleton() # Load Silero VAD model
proc.userdata["google_plugin_available"] = GOOGLE_PLUGIN_AVAILABLE
Entrypoint function:
- Parses room name for agent_id/session_id
- Initializes Redis logging
- Sets up event monitoring
- Registers room event handlers
- Runs agent session
Logging setup:
Axiom logging configured at module level (main process only, inherited by workers):
if mp.current_process().name == 'MainProcess':
axiom_client = axiom_py.Client(axiom_token)
axiom_handler = AxiomHandler(axiom_client, axiom_dataset)
root_logger.addHandler(axiom_handler)
else:
# Child process - handler will be inherited via fork
root_logger.setLevel(logging.INFO)
Redis logging initialized in entrypoint when job starts:
async def entrypoint(ctx: JobContext):
# ... parse room name, initialize monitoring ...
setup_redis_logging(job_id=job_id, room_id=room_id, session_id=session_id)
logger.info(f"✅ Redis log handler initialized for job {job_id}, session {session_id}")
Operating System
LiveKit Cloud managed infrastructure, joining from a MacBook Pro in chrome.
Models Used
Don't think it's relevant here but Nova-3, GPT 4.1, Google Chirp 3 HD
Package Versions
Backend:
Python: 3.11
livekit: 1.0.12
livekit-agents: 1.2.14
livekit-api: 1.0.5
livekit-protocol: 1.0.4
livekit-plugins-silero: 1.2.14 (VAD)
livekit-plugins-openai: 1.2.14
livekit-plugins-anthropic: 1.2.14
livekit-plugins-deepgram: 1.2.14
livekit-plugins-cartesia: 1.2.14
livekit-plugins-elevenlabs: 1.2.14
livekit-plugins-google: 1.2.14
Client SDK:
livekit-client: 2.15.14 (React/TypeScript frontend)
Session/Room/Call IDs
Here are examples of rooms that failed, what is interesting is that there doesn't seem to be any agent in comparison to one that succeeds:
Failed:
-
Room: agent_7f532e0b-66ef-448d-885e-4e5152dbb204_session_session_1762731129067_3d3tf2z
-
Room session ID: RM_Kyc92XrgWN6g
-
Room: agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762731096451_3wjnd7n
-
Room session ID: RM_7RcfQg4jyzVn
Succeeded (after redeployment):
- Room: agent_38c2242a-4539-4607-bfbe-50f4b3eb4f5c_session_session_1762732720714_5bo8yka
- Room session ID: RM_4AFHQUrDxdZY
Proposed Solution
Additional Context
Here is my dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu ||
pip install --no-cache-dir torch==2.3.1
RUN pip install --no-cache-dir --upgrade -r requirements.txt
COPY . .
RUN python main.py download-files
EXPOSE 8081
ENV LIVEKIT_NUM_PROCESSES=4
ENV LIVEKIT_LOAD_THRESHOLD=0.7
ENV LIVEKIT_MEMORY_WARN_MB=1000
ENV LIVEKIT_MEMORY_LIMIT_MB=2000 (could that be an issue? I am getting warnings when starting up the service, what should I put?)
CMD ["python", "main.py", "start"]