CTranslate2
CTranslate2 copied to clipboard
Memory increase
I used Ctranslate2-quantized version of fastchat-t5 (https://huggingface.co/limcheekin/fastchat-t5-3b-ct2), as the LLM of a question answering system. The QA system is wrapped in Rest API. The model works really well. But an issue I notice is that the GPU memory footprint increases over time (requests), and eventually causing OOM error.
In my case, I set max_batch_size=1, max_input_size=2048, max_decoding_size=1024. The GPU is L4 with 24 GB ram (should be more than enough for the model which only takes 3GB once loaded).
I was thinking about the following solutions. Can you please provide some suggestions?
-
Separate the LLM from the QA system, spin up a Rest API just for the LLM, and call the API endpoint in the QA system. This is because I notice many hosting services like vLLM claims better throughput handling. Will CTranslate2 model benefit from doing so?
-
I notice in the OpenNMT-py RestAPI server, the model is unloaded to cpu and reload, based on a timer. When I try it out, the unload and load take a couple of seconds. Doing this every a few requests, does not seem to be efficient.
Thank you.
Load and unload per request. If you are using FastAPI, you can use BackgroundTask to do this efficiently.
def reload():
model.unload()
model.load()
@router.post('/query')
def query(...):
result = model.generate(...)
BackgroundTask.add_task(reload)
return result
Load and unload per request.
Wouldn't this add tremendous latency...?
Wouldn't this add tremendous latency...?
BackgroundTask
is non-blocking so if each request is spaced out sufficiently, there should be no noticeable latency. Might want to add a check before generating to check if the model is loaded though. The issue is probably related to one of the draft PRs and is in the works of being implemented. While you wait for that, this can be a bandaid.
Oh, and of course you can instead do this every few request too.
That's a big "if"...
There's clearly a memory leak somewhere, be it in CTranslate2 or, more likely, in OP's application code.
Could be related. https://github.com/SYSTRAN/faster-whisper/issues/660