DocsGPT
DocsGPT copied to clipboard
Cache local model to reduce memory usage and delays
- What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.
- Why was this change needed? (You can also link to an open issue here)
Fixes https://github.com/arc53/DocsGPT/issues/945
- Other information:
Someone is attempting to deploy a commit to the Arc53 Team on Vercel.
A member of the Team first needs to authorize it.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| docs-gpt | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jun 14, 2024 0:04am |
Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.
Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.
Is there any advantage or disadvantage to either one that you see?
Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.
Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.
Is there any advantage or disadvantage to either one that you see?
No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore.
Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!
Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!
I'm glad to hear it helped!