LitServe icon indicating copy to clipboard operation
LitServe copied to clipboard

Potential Memory Leak in Response Buffer

Open aniketmaurya opened this issue 9 months ago β€’ 7 comments

πŸ› Bug

The response_buffer dictionary could grow indefinitely if requests fail before their responses are processed.

Implement a cleanup mechanism or timeout for orphaned entries

Code sample

Expected behavior

Environment

If you published a Studio with your bug report, we can automatically get this information. Otherwise, please describe:

  • PyTorch/Jax/Tensorflow Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

aniketmaurya avatar Mar 12 '25 11:03 aniketmaurya

'Could grow' Is there a way we can test this? so we can run it again after writing the potential fix Also what do you want the cleanup mechanism to look like?

haddyadnan avatar Mar 16 '25 05:03 haddyadnan

"Could grow" β€” is there a way we can test this so we can rerun it after writing a potential fix? Also, what do you want the cleanup mechanism to look like?

Hi @haddyadnan, to test this, I think you can simulate a batch of requests (say 100 or more), and randomly cancel around 30–40% of them before the response is processed or returned. You could emulate this by introducing an artificial delay (e.g., time.sleep) on the server side and cancelling the request from the client during that delay.

After this test run, you can inspect the response_buffer β€” it should still hold entries for the cancelled requests, indicating a potential memory leak risk.

As for the cleanup mechanism: since the buffer ideally should only retain responses for active requests, we could track disconnected or cancelled request IDs. Then, using a non-blocking background task (e.g., a periodic coroutine) or any other mechanism , we could scan the buffer and remove entries associated with those stale request IDs.

I remember implementing something similar in an older (now closed) PR, but for a different use case.

@aniketmaurya wdyt?

bhimrazy avatar May 18 '25 16:05 bhimrazy

thank you for elaborating here @bhimrazy! sounds like a great idea πŸš€

aniketmaurya avatar May 18 '25 18:05 aniketmaurya

Hi @kumarrah2002, would you like to take a look at this? It might be an interesting one.

bhimrazy avatar Jun 12 '25 07:06 bhimrazy

Will do @bhimrazy @aniketmaurya , do you think a timed expiration check would be the best approach here? We can add a timestamp field to the response buffer with an expiration of 3-5 minutes and have an asynchronous function that checks the buffer every minute. The timestamp can go here: response_buffer: Dict[str, Union[Tuple[deque, asyncio.Event, **float**], asyncio.Event]],

kumarrah2002 avatar Jun 12 '25 23:06 kumarrah2002

Will do @bhimrazy @aniketmaurya , do you think a timed expiration check would be the best approach here? We can add a timestamp field to the response buffer with an expiration of 3-5 minutes and have an asynchronous function that checks the buffer every minute. The timestamp can go here: response_buffer: Dict[str, Union[Tuple[deque, asyncio.Event, **float**], asyncio.Event]],

Hi @kumarrah2002, sounds good to me. Just need to make sure we don’t remove entries for clients that are still connected. I think we can align the cleanup logic with the timeout value in LitServer β€” anything beyond that should be safe to treat as stale.

bhimrazy avatar Jun 13 '25 19:06 bhimrazy

This likely seems to only happen in streaming cases, since for non-streaming responses, the request/response usually completes even if the client disconnects midway. I'll try to share a reproducible script sometime this or next week.

bhimrazy avatar Jul 13 '25 17:07 bhimrazy