python-sdks icon indicating copy to clipboard operation
python-sdks copied to clipboard

Memory leak problem

Open zhushixia opened this issue 7 months ago • 5 comments

hi, master, I also have a memory leak issue, when i use python-sdk to create 4000 rooms each time (client uses Room objects to connect the room), and then exit, repeat this several times, and you can see that the memory is constantly growing every time, Is there still memory not released after connecting through the Room object?

environments: livekit==1.0.8 livekit_agents==1.0.22 python==3.11

4000 15.6G 18.1G 17.9G 19.0G 18.7G 19.7G 19.5G

zhushixia avatar May 23 '25 01:05 zhushixia

Hey, is this using directly rtc.Room? Or you're using livekit-agents?

theomonnom avatar May 25 '25 22:05 theomonnom

Hey @theomonnom , I faced the same issue with rtc.Room.

I figured out some problems:

  • Forget to mark task_done in some loop. I added PR with hope to resolve a part of that issue.
  • In high load, my Livekit usually don't return event correctly. That makes client is blocked at init step and my FastAPI is blocked too. I have to init AudioStream or Room.disconnect in separate thread to avoid blocking event loop.
  • I start 200 rooms with 1 room/second. After all rooms are stopped, the memory and CPU is still high, I don't know why, please take a look at the image. Image

thinh-dang-ts avatar Jun 16 '25 04:06 thinh-dang-ts

Could you share a minimal reproducible example?

theomonnom avatar Jun 16 '25 17:06 theomonnom

@theomonnom , I have a quick setup in this repo

This is my Livekit server configuration

  • Livekit server: livekit/livekit-server:v1.8.3
  • Livekit SIP: regitllams/livekit-sip:v1.0.0
  • Memory: 4GB
  • CPU: 4vCPU Image

As you can see, the Livekit server don't look like overload during my test.

My test setup is start to request 10 request/seconds. This is not a high load but there are still having some request failed.

This is my raw log:

log.log

My thoughts about the memory leak issue:

  1. After having track subscribe failed log, audio stream can't connect to Livekit server and Server request to leave

Service log

livekit::rtc_engine::rtc_session:540:livekit::rtc_engine::rtc_session - signal_event taking too much time: Offer(SessionDescription { r#type: "offer", sdp: "v=0\r\no=- 8186539432328218886 1750474505 IN IP4 0.0.0.0\r\ns=-\r\nt=0 0\r\na=msid-semantic:WMS*\r\na=fingerprint:sha-256 EB:A3:1C:0F:AE:07:60:C7:49:73:03:6C:E3:94:BC:76:B4:F4:21:7F:BE:D2:5A:39:C7:2C:32:C2:EF:90:00:67\r\na=extmap-allow-mixed\r\na=group:BUNDLE 0 1\r\nm=application 9 UDP/DTLS/SCTP webrtc-datachannel\r\nc=IN IP4 0.0.0.0\r\na=setup:actpass\r\na=mid:0\r\na=sendrecv\r\na=sctp-port:5000\r\na=ice-ufrag:XysvbbawUOYBLbLZ\r\na=ice-pwd:ohkCXcyzEvTrdIycuVNgNelBvVdSvJUK\r\na=candidate:137188993 1 udp 2130706431 10.5.160.52 52118 typ host\r\na=candidate:137188993 2 udp 2130706431 10.5.160.52 52118 typ host\r\na=candidate:337244590 1 udp 2130706431 169.254.123.1 51502 typ host\r\na=candidate:337244590 2 udp 2130706431 169.254.123.1 51502 typ host\r\na=candidate:2859870515 1 udp 2130706431 100.231.2.65 52896 typ host\r\na=candidate:2859870515 2 udp 2130706431 100.231.2.65 52896 typ host\r\na=candidate:2859870515 1 udp 2130706431 100.231.2.65 56085 typ host\r\na=candidate:2859870515 2 udp 2130706431 100.231.2.65 56085 typ host\r\na=candidate:4058947254 1 udp 2130706431 169.254.4.6 51987 typ host\r\na=candidate:4058947254 2 udp 2130706431 169.254.4.6 51987 typ host\r\na=candidate:3051235169 1 tcp 1671430143 10.5.160.52 7881 typ host tcptype passive\r\na=candidate:3051235169 2 tcp 1671430143 10.5.160.52 7881 typ host tcptype passive\r\na=candidate:2850717774 1 tcp 1671430143 169.254.123.1 7881 typ host tcptype passive\r\na=candidate:2850717774 2 tcp 1671430143 169.254.123.1 7881 typ host tcptype passive\r\na=candidate:394614995 1 tcp 1671430143 100.231.2.65 7881 typ host tcptype passive\r\na=candidate:394614995 2 tcp 1671430143 100.231.2.65 7881 typ host tcptype passive\r\na=candidate:1277030230 1 tcp 1671430143 169.254.4.6 7881 typ host tcptype passive\r\na=candidate:1277030230 2 tcp 1671430143 169.254.4.6 7881 typ host tcptype passive\r\nm=audio 9 UDP/TLS/RTP/SAVPF 111 63\r\nc=IN IP4 0.0.0.0\r\na=setup:actpass\r\na=mid:1\r\na=ice-ufrag:XysvbbawUOYBLbLZ\r\na=ice-pwd:ohkCXcyzEvTrdIycuVNgNelBvVdSvJUK\r\na=rtcp-mux\r\na=rtcp-rsize\r\na=rtpmap:111 opus/48000/2\r\na=fmtp:111 minptime=10;useinbandfec=1\r\na=rtcp-fb:111 nack \r\na=rtpmap:63 red/48000/2\r\na=fmtp:63 111/111\r\na=ssrc:2963897458 cname:PA_o27c8ggnCiC6|TR_AMUNTKARrKcksD\r\na=ssrc:2963897458 msid:PA_o27c8ggnCiC6|TR_AMUNTKARrKcksD TR_AMUNTKARrKcksD\r\na=ssrc:2963897458 mslabel:PA_o27c8ggnCiC6|TR_AMUNTKARrKcksD\r\na=ssrc:2963897458 label:TR_AMUNTKARrKcksD\r\na=msid:PA_o27c8ggnCiC6|TR_AMUNTKARrKcksD TR_AMUNTKARrKcksD\r\na=sendrecv\r\n", id: 0 })
livekit::rtc_engine:453:livekit::rtc_engine - received session close: "server request to leave" StateMismatch Resume
livekit::rtc_engine:453:livekit::rtc_engine - received session close: "signal client closed: \"stream closed\"" UnknownReason Resume

Livekit log:

Image
  1. The connection is stuck there, the resource can't released and my app keep hold the object in memory.

This is my memory breakdown:

1. /Users/thinh.dang2/Documents/codes/collin_telephony_service/simple_livekit_audio_test.py:415: size=16.0 MiB, count=8, average=2043 KiB
2. /opt/anaconda3/envs/py311/lib/python3.11/linecache.py:137: size=451 KiB, count=4467, average=103 B
3. /Users/thinh.dang2/Library/Caches/pypoetry/virtualenvs/telephony-client-veot9Bdo-py3.11/lib/python3.11/site-packages/livekit/rtc/_ffi_client.py:151: size=416 KiB, count=8209, average=52 B
4. /opt/anaconda3/envs/py311/lib/python3.11/asyncio/queues.py:54: size=72.7 KiB, count=141, average=528 B
5. <frozen importlib._bootstrap_external>:729: size=67.9 KiB, count=1093, average=64 B
6. /opt/anaconda3/envs/py311/lib/python3.11/asyncio/locks.py:168: size=21.5 KiB, count=58, average=380 B
7. /opt/anaconda3/envs/py311/lib/python3.11/stringprep.py:24: size=18.1 KiB, count=2, average=9256 B
8. /opt/anaconda3/envs/py311/lib/python3.11/asyncio/queues.py:48: size=14.8 KiB, count=40, average=380 B
9. /opt/anaconda3/envs/py311/lib/python3.11/asyncio/queues.py:39: size=14.8 KiB, count=40, average=380 B
10. /opt/anaconda3/envs/py311/lib/python3.11/asyncio/queues.py:37: size=14.8 KiB, count=40, average=380 B

If the call can't stop, these objects is kept in memory forever -> That make my memory increase significantly

  1. I try to stop the app. However, my app is stuck because the thread connect to audio stream is alive.

Could you help to give me any advice in this case? I think the main problem is if something is wrong happens, my app is stuck and can't release resource -> async task and allocated audio object can't be released

thinhdanggroup avatar Jun 21 '25 03:06 thinhdanggroup

Hi team, we’ve confirmed a consistent native memory growth pattern in the LiveKit Python SDK under sustained load. The issue appears to originate from unreleased allocations in the FFI layer (liblivekit_ffi.so).

Context

The SDK is running inside a WebSocket API that acts as an audio bridge between two streaming endpoints. The service creates room connections repeatedly, and even after all rooms are closed and garbage collection is forced, RSS memory increases after each batch.

Environment

  • OS: Linux (Debian-based, aarch64)
  • Python: 3.12.3
  • LiveKit SDK: livekit == 1.0.18
  • Allocator: tested with both default glibc and jemalloc (dirty_decay_ms:0, muzzy_decay_ms:0)
  • Server runtime: uvicorn + asyncio (no threading, no multiprocessing)
  • FFI library: liblivekit_ffi.so bundled with the current SDK release

The issue reproduces consistently across environments, including containers running in GKE and local bare-metal Linux/Mac.

Evidence

  1. Stable Python heap
Baseline heap: 64.5 MB
After 10 calls: 68.8 MB
After 100 calls: 68.9 MB
→ Heap stable (+6.8%), tracemalloc shows no persistent object growth
  1. RSS steadily increasing
Baseline RSS: 249 MB
After 10 calls: 263 MB
After 100 calls: 295 MB
Δ ≈ +46 MB (~18 %)
  1. smaps diffs
+13 MB  /usr/local/bin/python3.12
+2.4 MB /app/.venv/lib/python3.12/site-packages/livekit/rtc/resources/liblivekit_ffi.so

No other mappings show proportional growth. Memory accumulates primarily in the process heap and the FFI shared object.
  1. tracemalloc diffs
/app/.venv/lib/python3.12/site-packages/livekit/rtc/_ffi_client.py:
    size=571 KiB (−162 KiB), count=10 768 (−3182)
→ Python allocations stable or shrinking
  1. Behavior with mitigations
    • Running malloc_trim(0) after each batch releases negligible memory.
    • Using jemalloc reduces RSS slope but not cumulative growth.
    • RSS continues to rise even when all rooms are disconnected and all FfiHandles are explicitly disposed.

Analysis

The pattern indicates a native memory leak in the FFI layer, most likely caused by retained allocations inside livekit_ffi_request() or its handle management logic. The leak persists after explicit FfiHandle.dispose() calls and even after invoking ffi_lib.livekit_ffi_dispose() at shutdown.

Request

Could the team confirm whether this FFI-level memory retention is already under investigation or if any patch is planned? We can share:

  • Detailed memray traces (Python + native allocators)
  • /proc/self/smaps snapshots before and after load tests
  • A minimal reproducible script

Thanks for your time and for maintaining this SDK. Happy to provide additional profiling data or to run diagnostic builds if that helps pinpoint the leak.

tarekasishm avatar Nov 07 '25 06:11 tarekasishm