DocsGPT icon indicating copy to clipboard operation
DocsGPT copied to clipboard

Cache local model to reduce memory usage and delays

Open starkgate opened this issue 1 year ago • 6 comments

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.

  • Why was this change needed? (You can also link to an open issue here)

Fixes https://github.com/arc53/DocsGPT/issues/945

  • Other information:

starkgate avatar May 28 '24 08:05 starkgate

Someone is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar May 28 '24 08:05 vercel[bot]

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs-gpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 14, 2024 0:04am

vercel[bot] avatar May 28 '24 13:05 vercel[bot]

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

dartpain avatar Jun 14 '24 12:06 dartpain

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore.

starkgate avatar Jun 17 '24 07:06 starkgate

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

dartpain avatar Jun 18 '24 10:06 dartpain

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

I'm glad to hear it helped!

starkgate avatar Jun 18 '24 11:06 starkgate