devika
devika copied to clipboard
Added llama-cpp-python support for local inference.
Added llama-cpp-python support for local inference.
-> It allows to run local models, eliminating the need for Ollama server. -> Made necessary configuration changes in config.toml file. -> Used Singleton design pattern for model creation.
@rajakumar05032000 Would it be possible to list and specify which LLM model is being served from llama.cpp
? It should be listed/choosable from the drop-down.
@rajakumar05032000 Would it be possible to list and specify which LLM model is being served from
llama.cpp
? It should be listed/choosable from the drop-down.
Currently It just shows LlamaCpp model in the drop down, It doesn't show the specific model name. I'll make the necessary changes as you mentioned to list & choose specific Llama.cpp model from the drop down.
Thanks for sharing your feedback!
Hey so what's update for this pr?
can anyone resolve the conflicts
this is a little late, but doesn't it make more sense from architecture POV to run the LLM models outside the server, and connect to the LLM serving process via a port, like done with ollama and other clients? the server can still spawn the external llm serving (docker may be easier than plain process, as all dependencies are included), or the user can point to the external ip:port I'm also looking into adding other hosted models, like TGI, that have dynamic batching and can process many concurrent requests. was this planned anywhere?