LitServe Potential Memory Leak in Response Buffer

🐛 Bug

The response_buffer dictionary could grow indefinitely if requests fail before their responses are processed.

Implement a cleanup mechanism or timeout for orphaned entries

Code sample

Expected behavior

Environment

If you published a Studio with your bug report, we can automatically get this information. Otherwise, please describe:

PyTorch/Jax/Tensorflow Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Mar 12 '25 11:03 aniketmaurya

'Could grow' Is there a way we can test this? so we can run it again after writing the potential fix Also what do you want the cleanup mechanism to look like?

Mar 16 '25 05:03 haddyadnan

"Could grow" — is there a way we can test this so we can rerun it after writing a potential fix? Also, what do you want the cleanup mechanism to look like?

Hi @haddyadnan, to test this, I think you can simulate a batch of requests (say 100 or more), and randomly cancel around 30–40% of them before the response is processed or returned. You could emulate this by introducing an artificial delay (e.g., time.sleep) on the server side and cancelling the request from the client during that delay.

After this test run, you can inspect the response_buffer — it should still hold entries for the cancelled requests, indicating a potential memory leak risk.

As for the cleanup mechanism: since the buffer ideally should only retain responses for active requests, we could track disconnected or cancelled request IDs. Then, using a non-blocking background task (e.g., a periodic coroutine) or any other mechanism , we could scan the buffer and remove entries associated with those stale request IDs.

I remember implementing something similar in an older (now closed) PR, but for a different use case.

@aniketmaurya wdyt?

May 18 '25 16:05 bhimrazy

thank you for elaborating here @bhimrazy! sounds like a great idea 🚀

May 18 '25 18:05 aniketmaurya

Hi @kumarrah2002, would you like to take a look at this? It might be an interesting one.

Jun 12 '25 07:06 bhimrazy

Will do @bhimrazy @aniketmaurya , do you think a timed expiration check would be the best approach here? We can add a timestamp field to the response buffer with an expiration of 3-5 minutes and have an asynchronous function that checks the buffer every minute. The timestamp can go here: response_buffer: Dict[str, Union[Tuple[deque, asyncio.Event, **float**], asyncio.Event]],

Jun 12 '25 23:06 kumarrah2002

Will do @bhimrazy @aniketmaurya , do you think a timed expiration check would be the best approach here? We can add a timestamp field to the response buffer with an expiration of 3-5 minutes and have an asynchronous function that checks the buffer every minute. The timestamp can go here: response_buffer: Dict[str, Union[Tuple[deque, asyncio.Event, **float**], asyncio.Event]],

Hi @kumarrah2002, sounds good to me. Just need to make sure we don’t remove entries for clients that are still connected. I think we can align the cleanup logic with the timeout value in LitServer — anything beyond that should be safe to treat as stale.

Jun 13 '25 19:06 bhimrazy

This likely seems to only happen in streaming cases, since for non-streaming responses, the request/response usually completes even if the client disconnects midway. I'll try to share a reproducible script sometime this or next week.

Jul 13 '25 17:07 bhimrazy