devika icon indicating copy to clipboard operation
devika copied to clipboard

Added llama-cpp-python support for local inference.

Open rajakumar05032000 opened this issue 11 months ago • 5 comments

Added llama-cpp-python support for local inference.

-> It allows to run local models, eliminating the need for Ollama server. -> Made necessary configuration changes in config.toml file. -> Used Singleton design pattern for model creation.

rajakumar05032000 avatar Mar 23 '24 06:03 rajakumar05032000

@rajakumar05032000 Would it be possible to list and specify which LLM model is being served from llama.cpp? It should be listed/choosable from the drop-down.

mufeedvh avatar Mar 26 '24 21:03 mufeedvh

@rajakumar05032000 Would it be possible to list and specify which LLM model is being served from llama.cpp? It should be listed/choosable from the drop-down.

Currently It just shows LlamaCpp model in the drop down, It doesn't show the specific model name. I'll make the necessary changes as you mentioned to list & choose specific Llama.cpp model from the drop down.

Thanks for sharing your feedback!

rajakumar05032000 avatar Mar 27 '24 05:03 rajakumar05032000

Hey so what's update for this pr?

ARajgor avatar Apr 05 '24 18:04 ARajgor

can anyone resolve the conflicts

ARajgor avatar Apr 17 '24 08:04 ARajgor

this is a little late, but doesn't it make more sense from architecture POV to run the LLM models outside the server, and connect to the LLM serving process via a port, like done with ollama and other clients? the server can still spawn the external llm serving (docker may be easier than plain process, as all dependencies are included), or the user can point to the external ip:port I'm also looking into adding other hosted models, like TGI, that have dynamic batching and can process many concurrent requests. was this planned anywhere?

rjanovski avatar Apr 29 '24 13:04 rjanovski