llm Update LLM docs to recommend different local solutions

Update LLM docs to recommend different local solutions

Open simonw opened this issue 6 months ago • 1 comments

These days, I think llm-ollama and llm-llama-server are the best options for local models for most people.

Mainly because they run as a separate process, which means that the model stays loaded in between llm calls.

They are also really easy to install! Ollama has an installer and llama-server can be had from Homebrew.

May 29 '25 01:05 simonw

Some random ideas for llm-llama-server:

As far as I can tell, it can only serve one model of time, but you can run multiple instances on different ports. It would be good if there was a convenient way to get different model IDs for different ports somehow
Shipping this well make llm-llama-server a lot more compelling: #1117
llm-llamafile is a separate plugin at the moment but it should be retired in favor of llm-llama-server
It might be good to have a llm llama-server add ... command of some sort for registering additional ports, I'm not sure what that should look like yet though
How about a llm llama-server start command which starts it running for you?
A wildly ambitious solution would be to bundle the binary in a bunch of wheels… most of the releases on https://github.com/ggml-org/llama.cpp/releases/tag/b5527 would easily fit in a 100 MB wheel, with the exception of the CUDA ones
if we do that, it should not be a required dependency of the plugin because I imagine a lot of users will be happy to download and run it separately or use homebrew.

May 29 '25 01:05 simonw