agents icon indicating copy to clipboard operation
agents copied to clipboard

Memory leak within docker container and local machines

Open theo1893 opened this issue 7 months ago • 19 comments

issue I'm not sure if this is a bug, but it seems there is not only one person encountering memory leak issue. The link above is the related post containing the problem description and lib version.

Here is what i found while debugging: I have been debugging this for several days, and i found this issue seems to be related to livekit ffi lib. I see when a user try connecting to a livekit room, the requests below will be sent to ffi lib sequentially:

connect -> create_audio_track -> publish_track -> set_subscribed -> new_audio_stream -> ...

and i found the memory leak seems(i'm not sure) to happen after set_subscribed.

I hope the information above can help locate and resolving this problem, or i have to restart my agent server every several hours 😭

Updated: the memory leak also happens on my local machine (M1 MacOS 15.3.2) on different python versions(3.11.7 / 3.12.6 / 3.13.2), and this repo contains the dependencies and reproducing steps in details

theo1893 avatar Apr 30 '25 12:04 theo1893

Yes, even I've experienced memory leaks, haven't been able to point out to the exact location of the leak, but it look like the memory used is not released even after the call is dropped

Denin-Siby avatar May 02 '25 07:05 Denin-Siby

could you describe the environment you are running on? what linux base image and what is the architecture?

are you able to reproduce with the standard basic_agent demo?

davidzhao avatar May 03 '25 06:05 davidzhao

@davidzhao hi David, i have run many tests(about 20+ cases) with different python versions, and i found this issue also happened on my local machine(M1 MacOS 15.3.2), not only the docker container, but i still don't know under which circumstances this will happen. The link below is the memory leak demo repo, which contains the codes(the simplest use of LiveKit Agent) and some memory usage graphs from my tests, and also the dependencies and reproducing steps. I hope this can be helpful.

https://github.com/theo1893/livekit-memory-leak-demo

I have not run the standard basic_agent demo up till now because of the collection of records of repo above, and i will run the standard demo asap to see whether the memory leak will happen.

theo1893 avatar May 03 '25 12:05 theo1893

@theo1893 I was not able to reproduce the issue with your script following the steps (connect to the agent twice), I tested both 1.0.17 and 1.0.18 multiple times. The RAM was released after I muted the audio. It's on MacOS 15 M4, python 3.12

The only difference is I didn't create a room with your script, connect from the agent playground will automatically create the room and have the agent join. But I don't think it will cause a different result. Could you try the latest version and use the playground to create the room and see if you still have the issue?

1.0.17 Image

Image

Image

1.0.18

Image

longcw avatar May 03 '25 14:05 longcw

An important detail to note is that the example above utilizes the THREAD-based JobExecutor. This behavior may not occur when using the default PROCESS-based executor.

https://github.com/theo1893/livekit-memory-leak-demo/blob/4accb4147516139532663bd7bc6839a5c2827d49/start_worker.py#L162

theomonnom avatar May 03 '25 18:05 theomonnom

@longcw hhi master, as i have said, this leak cannot be reproduced under my control right now, but i still succeeded to reproduce this at about 2025-05-04 09:10 (UTC+8), and here is the result graph:

Image

Image

Here is my operation steps:

  1. Create a new conda env with python=3.12.6, with the requirements.txt
  2. Execute python start_worker.py
  3. Directly connect to the room via PlayGround, without executing create_room.py (because as you mentioned, the connecting will create a new room, and i do not need to create a room manually)
  4. Don't say anything, and watch the memory output on console for seconds.
  5. Disconnect.
  6. Repeat the Step3 to Step5 for 5 times. The memory usage is relatively stable during these 5 tests.
  7. Then at the 6th connecting-watching, the memory leak happened, as the graph above shows.

theo1893 avatar May 04 '25 03:05 theo1893

@theomonnom hihi master, this memory leak issue was first observed in our Process-Based agent server with dependencies as below:

Image

On this server we got 2 points:

  1. For some reason the memory is always increasing, which is what we are discussing about here. I have upgraded the server to 1.X api version, but the leak still exists.
  2. The memory usage was exceptionally higher than my expectation when server starts, and i thought this was because we were using Process-Based executor which means each JobProcess will have an independent memory space. Then i changed this to Thread-Based executor, and then the memory usage of server staring become normal.

in my opinion the style of JobProcess only means different workload type (Process or Thread), and this may not be related to this issue🤔

theo1893 avatar May 04 '25 05:05 theo1893

@theo1893 I was able to reproduce the issue after creating and closing the jobs multiple times. The issue is that the AudioStream was not closed properly after job exiting. It should be already fixed after the 1.0.18, can you try the latest version and see if that fixes the memory leak for you?

longcw avatar May 04 '25 13:05 longcw

@longcw thank you for you help! i will try the latest api asap to see if this issue has been fixed 👏

theo1893 avatar May 05 '25 03:05 theo1893

@longcw hihi master, after upgrading livekit-agents to 1.0.18, this memory increasing still exists after connections on my M1 Mac, but at the first several connections the memory usage is relatively stable.

Image

theo1893 avatar May 06 '25 03:05 theo1893

You can use the PROCESS-based executor for now. I'll investigate it further.

longcw avatar May 06 '25 04:05 longcw

@theo1893 can you try to reproduce it with any of the examples in the repo in dev mode and share the logs when the issue happens? I cannot reproduce it in the latest branch now.

longcw avatar May 06 '25 05:05 longcw

@longcw sure. This is the log file at 2025-05-06 11:18(UTC+8) corresponding to the image i posted.

livekit_memory_leak.log

and here is the dependencies list related to livekit:

Image

theo1893 avatar May 06 '25 06:05 theo1893

sorry I mean the debugging logs of the agent, maybe you can enable it in your script or use the examples in the agents repo.

longcw avatar May 06 '25 08:05 longcw

On the Y axis, we have the memory usage in GB. And you can see that my bot in the docker container seems to run out of memory in minutes eventually leading to a pod restart. With max number of user's 4.

Image Image

Denin-Siby avatar May 06 '25 14:05 Denin-Siby

@longcw hihi master, sry for replying late. i have enabled the debug log in agent, and here is the log file:

livekit_memory_leak_debug.log

and below is the corresponding graph:

Image

although the memory usage is able to decrease on my M1 Mac, there still exists continuous memory increasing during connection(after i tried many times), at about 0.1 ~ 0.2MB per second. it seems to be abnormal.

theo1893 avatar May 07 '25 03:05 theo1893

From the log all the audio stream created were closed properly when the participant disconnected. So either it's a different issue or it's the memory used for some data during the agent responding. Can you run it for a longer time like a few hours to see what is the max RAM usage.

FYI, if you are blocked by this issue, the process executor shouldn't have the issue as it will close the process when the participant disconnected, all RAM will be released in that case.

longcw avatar May 07 '25 09:05 longcw

@longcw Thank you master! i will try process executor after upgrading libs in our production instances!

theo1893 avatar May 09 '25 07:05 theo1893

@longcw hi, master, I also have a memory leak, creating 4000 rooms each time (client uses Room objects to connect the room), and then exit, repeat this several times, and you can see that the memory is constantly growing every time, Is there still memory not released after connecting through the Room object?

environments: livekit==1.0.8 livekit_agents==1.0.22 python==3.11

4000

  1. 15.6G 18.1G
  2. 17.9G 19.0G
  3. 18.7G 19.7G 19.5G

zhushixia avatar May 20 '25 10:05 zhushixia

this happens even when you run it via Processes? @zhushixia

Denin-Siby avatar May 20 '25 19:05 Denin-Siby

this happens even when you run it via Processes? @zhushixia

The process has not happened, but it occupies too much memory and an appeal leak occurs in the thread

zhushixia avatar May 21 '25 01:05 zhushixia

@longcw hi, master, https://github.com/livekit/agents/issues/1186#issuecomment-2836081048, Maybe this will help you locate the problem, I am from ubuntu too

zhushixia avatar May 22 '25 05:05 zhushixia

Image

Experiencing memory leak as well using processes. Happens pretty slowly but happens to all of my agent containers.

keepingitneil avatar May 23 '25 22:05 keepingitneil

Same problem here, seeing similar graph even in my agent too. Running on Debian 12 host, inside a docker container.

arpan-reconectai avatar Jun 27 '25 22:06 arpan-reconectai

Pipecat framework seems to support some force_gc=True when disconnecting participants, to avoid similar situations (I guess). Could we track what gets cleaned up and what doesn't with GC?

arpan-reconectai avatar Jun 28 '25 06:06 arpan-reconectai

I used tracemalloc to troubleshoot the issue. I took a snapshot before a room started up, and generated another snapshot when closing the connection, to get the code lines that consumed the most memory. I'm not sure if this log can help with locating the problem.

D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc_ffi_client.py:123: size=7076 KiB (+6882 KiB), count=129122 (+125839), average=56 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\asyncio\base_events.py:853: size=6657 KiB (+6394 KiB), count=65549 (+62955), average=104 B D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc_ffi_client.py:151: size=6654 KiB (+6391 KiB), count=131026 (+125850), average=52 B D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc\audio_frame.py:62: size=7428 KiB (+5798 KiB), count=25105 (+19377), average=303 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\asyncio\events.py:38: size=4095 KiB (+3933 KiB), count=65520 (+62930), average=64 B D:\workspace\codeup\livekit-agent\livekit-plugins\livekit-plugins-silero\livekit\plugins\silero\vad.py:304: size=5663 KiB (+2827 KiB), count=4 (+2), average=1416 KiB D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc\audio_frame.py:94: size=1587 KiB (+1252 KiB), count=23335 (+18453), average=70 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\asyncio\sslproto.py:278: size=1280 KiB (+1024 KiB), count=10 (+8), average=128 KiB D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc\audio_frame.py:67: size=751 KiB (+586 KiB), count=12023 (+9380), average=64 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\asyncio\base_events.py:856: size=529 KiB (+508 KiB), count=1025 (+985), average=528 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\asyncio\proactor_events.py:191: size=448 KiB (+192 KiB), count=14 (+6), average=32.0 KiB :784: size=84.3 KiB (+84.3 KiB), count=233 (+233), average=370 B D:\workspace\codeup\livekit-agent\livekit-agents\livekit\agents\debug\tracing.py:49: size=84.3 KiB (+83.8 KiB), count=1772 (+1762), average=49 B C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.13_3.13.1520.0_x64__qbz5n2kfra8p0\Lib\inspect.py:266: size=82.7 KiB (+75.7 KiB), count=706 (+646), average=120 B D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\livekit\rtc_ffi_client.py:232: size=78.5 KiB (+74.4 KiB), count=590 (+563), average=136 B D:\workspace\codeup\livekit-agent\livekit-plugins\livekit-plugins-jzx\livekit\plugins\jzx\realtime_llm.py:300: size=93.6 KiB (+73.6 KiB), count=1 (+0), average=93.6 KiB D:\workspace\codeup\livekit-agent.venv\Lib\site-packages\pydantic\main.py:463: size=73.8 KiB (+73.2 KiB), count=916 (+908), average=82 B D:\workspace\codeup\livekit-agent\livekit-agents\livekit\agents\debug\tracing.py:34: size=80.6 KiB (+72.2 KiB), count=1173 (+1166), average=70 B :123: size=158 KiB (+60.3 KiB), count=1511 (+733), average=107 B D:\workspace\codeup\livekit-agent\livekit-agents\livekit\agents\voice\agent_session.py:522: size=47.8 KiB (+47.5 KiB), count=672 (+668), average=73 B

MicroCountry avatar Jul 17 '25 08:07 MicroCountry

Thank you guys for your patience. Please try the following alternatives and if the issue persists please send me an email and I will take further teps from there.

Do you mean LiveKit (the real‑time audio/video SFU)? I’ll assume that — if you meant something else say so and I’ll adjust. Below is a focused list of common LiveKit memory‑leak causes and a compact, practical step‑by‑step troubleshooting + fix plan for client (web, Android, iOS) and server (Go) setups.High‑level causes specific to LiveKit•Not closing/disposing PeerConnections / Room / Engine objects (clients).•Not stopping MediaStreamTracks or local capture (getUserMedia streams left running).•Not detaching/removing renderers (video elements, SurfaceViews, VideoRenderers).•Event listeners / callbacks left registered on Room, Participants, Tracks.•Unpublished/unsubscribed tracks still referenced by app code or caches.•Timers/intervals (stats polling, keepalives) not cleared.•For Android/iOS: SDK objects (renderers, sinks, Track objects) not released (native resources remain).•For Go server: goroutine leaks (e.g., per-connection loops not exiting), maps/slices that grow unbounded (participant/room caches), not closing network connections, not flushing/closing track buffers.•WebRTC-specific: RTCPeerConnection, Transceivers, and MediaStreamTracks retained.•Third‑party components holding references (UI frameworks, logging, analytics).Step‑by‑step troubleshooting workflow (applies to any platform)1.Reproduce deterministically•Create a small scenario that exercises connect/disconnect, publish/unpublish, join/leave repeatedly.2.Measure baseline & growth•Record memory before, during, and after several cycles. Look for progressive growth that doesn’t drop after GC.3.Pick the right profiler/tooling (see below) and capture snapshots over time•Take heap snapshots or native memory recordings at intervals (e.g., after each join/leave).4.Compare snapshots to identify types that grow (PeerConnection, MediaStreamTrack, DOM nodes, custom objects)•Inspect retained size and reference paths to GC roots.5.Inspect retention paths to find why objects are reachable•Look for event listeners, static caches, global vars, timers, UI components holding refs.6.Fix: remove/unsubscribe/stop/close and free resources at lifecycle boundaries•Add explicit cleanup on leave/dispose/unmount.7.Re-test and confirm memory stabilizes8.Add lifecycle unit tests and monitoring to catch regressionsPlatform‑specific detection & fixesWeb / LiveKit JS•Tools: Chrome DevTools (Memory panel: Heap snapshot, Allocation instrumentation, Timeline); WebRTC internals (about:webrtc in some browsers).•Common culprits: RTCPeerConnection, MediaStreamTrack, HTMLVideoElement nodes, Room event listeners.•Cleanup checklist on leave/dispose:•room.disconnect() (or room.off + close underlying connections).•Stop local capture: for each localTrack call track.stop() (or MediaStreamTrack.stop()).•Detach & remove video elements: for each track call track.detach() or video.srcObject = null and remove element.•Unsubscribe/unpublish if needed: localParticipant.unpublishTrack(track).•Remove event listeners: room.off(...), participant.off(...).•Clear any polling timers (stats intervals).•Debug steps (example):•Reproduce join/leave N times.•Take heap snapshots after each cycle and compare.•Search for WebRTC objects, PeerConnection, MediaStreamTrack, HTMLVideoElement retained.Android (LiveKit Android SDK)•Tools: Android Studio Profiler (Memory), dump HPROF → Analyze with MAT, LeakCanary for runtime leak detection.•Common culprits: SurfaceViewRenderer/TextureView leaks, VideoSink still attached, tracks not released, PeerConnection not closed.•Cleanup checklist:•room.disconnect() / room.close() (use the SDK’s disconnect method).•Stop and release local tracks: localVideoTrack.stopCapture()/stop() and localVideoTrack.release() (check exact SDK methods).•Remove video sinks: videoTrack.removeSink(renderer); call renderer.release() / surfaceView.release() as appropriate.•Unregister listeners/callbacks from Room/Participants/Tracks.•Stop polling timers and background handlers.•Debug steps:•Use LeakCanary to detect retained Activities or Views.•Capture an HPROF and inspect retained objects such as PeerConnection, MediaStreamTrack, Activity.iOS (LiveKit iOS SDK)•Tools: Xcode Instruments (Allocations, Leaks), memory graph debugger.•Culprits: Video renderers not released, tracks still active, strong reference cycles (closures).•Cleanup checklist:•room.disconnect() / room.dispose() per SDK.•Stop local capture and release tracks: localVideoTrack.stop() / localVideoTrack.release().•Remove renderers/sinks and remove from view hierarchy.•Remove listeners/observers and invalidate timers.•Check closures / delegates for strong reference cycles; use weak references.•Debug steps:•Run Instruments while join/leave cycles executing; look for allocations that don’t drop and for leaked objects.LiveKit Server (Go)•Tools: go pprof (heap, goroutine), /debug/pprof endpoints, go tool pprof -http=:6060, pprof in production, vet for goroutine leaks.•Common culprits: goroutines blocked on channels, not closing transports or subscriptions, maps/slices retaining participant state, file descriptors not closed.•Cleanup checklist:•Ensure connection teardown closes all goroutines (use contexts, cancel on disconnect).•Close network connections and transports, drain channels where necessary.•Remove participants from room maps and free structures.•Avoid unbounded caches; add eviction or TTL.•Explicitly stop any background workers per room when room closed.•Debug steps:•Collect goroutine profile before/after connecting/disconnecting; look for goroutines that pile up.•Capture heap profile and inspect large allocations by stack traces.Concrete examples (conceptual — check SDK version for exact call names)Web (JS) cleanup pattern:•On leaving a room:•room.offAllListeners();•for each localTrack: localTrack.stop(); localTrack.detach(); localParticipant.unpublishTrack(localTrack);•remove video elements from DOM and null references;•room.disconnect(); set room = null;•clearInterval(statsInterval);Android (pseudocode):•onDestroy / onLeave:•room.offAllListeners();•for each localTrack: localTrack.stopCapture(); localTrack.release();•for each remoteVideoRenderer: videoTrack.removeSink(renderer); renderer.release();•room.disconnect(); room = null;•cancel background handlers/timers.Go server (pprof usage):•Enable pprof handler:•import _ "net/http/pprof"•go func() { log.Fatal(http.ListenAndServe(":6060", nil)) }()•Take heap profile:•go tool pprof http://localhost:6060/debug/pprof/heap•In pprof UI: top, web, list •Take goroutine profile:•go tool pprof http://localhost:6060/debug/pprof/goroutineChecklist for preventing leaks in LiveKit apps•Always tie resource lifetime to lifecycle (Activity/VC/component).•Explicitly stop and release media tracks and renderers.•Unregister all SDK event listeners on teardown.•Use weak references or avoid long‑lived global caches of track/participant objects.•Limit stats polling frequency and clear timers.•Add automated smoke tests that repeatedly join/leave and assert memory doesn’t trend upward.•Use LeakCanary / Instruments / DevTools in development CI.If you want, next I can:•Give exact cleanup code for the platform you care about (Web, Android, iOS, or Go server). Paste a short snippet of your current join/leave/cleanup code and I’ll point out precisely what to add/change to eliminate leaks.

OscaeGTX avatar Aug 17 '25 13:08 OscaeGTX