DocsGPT Cache local model to reduce memory usage and delays

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.

Why was this change needed? (You can also link to an open issue here)

Fixes https://github.com/arc53/DocsGPT/issues/945

Other information:

May 28 '24 08:05 starkgate

Someone is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

May 28 '24 08:05 vercel[bot]

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs-gpt	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 14, 2024 0:04am

May 28 '24 13:05 vercel[bot]

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

Jun 14 '24 12:06 dartpain

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore.

Jun 17 '24 07:06 starkgate

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

Jun 18 '24 10:06 dartpain

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

I'm glad to hear it helped!

Jun 18 '24 11:06 starkgate

DocsGPT DocsGPT copied to clipboard

Cache local model to reduce memory usage and delays

DocsGPT
DocsGPT copied to clipboard