Required configuration to run Llama.cpp directly
I would like to configure Llama.cpp directly with Ollama, but I can't figure out how to do it.
As far as I understand, Llama.cpp has a way to serve OpenAI-compatible endpoints.
Is there any way to run it directly in Olla?
Ah good timing, we put back llamacpp support but not created a new release yet.
You can start llamacpp (or forks) with something like - we build from source, so maybe different for you exec wise:
./build/bin/llama-server \
--model /mnt/storage/models/openai_gpt-oss-120b/ggml-model-f16.gguf \
--host 0.0.0.0 \
--port 8001 \
--ctx-size 4096 \
--n-gpu-layers -1 \
--threads $(nproc) \
--mlock \
--no-mmap
Then curl the /v1/chat/completions:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
That PR is merged but we're testing internally (you can build the current main and try it now if you wish though).
See the pre-release documentation at: https://thushan.github.io/olla/integrations/backend/llamacpp/
But it will be v0.0.20 (or we might finally move up to v0.1.0) because it now supports llamacpp again properly (though no management APIs which we descoped).
Let me know your thoughts or challenges you may come across, I still have a couple more fixes to make so target is 20th or 23rd of October.
Ah good timing, we put back llamacpp support but not created a new release yet.
You can start llamacpp (or forks) with something like - we build from source, so maybe different for you exec wise:
./build/bin/llama-server \ --model /mnt/storage/models/openai_gpt-oss-120b/ggml-model-f16.gguf \ --host 0.0.0.0 \ --port 8001 \ --ctx-size 4096 \ --n-gpu-layers -1 \ --threads $(nproc) \ --mlock \ --no-mmapThen curl the
/v1/chat/completions:curl http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [ {"role": "user", "content": "Hello, how are you?"} ] }'That PR is merged but we're testing internally (you can build the current
mainand try it now if you wish though).See the pre-release documentation at: https://thushan.github.io/olla/integrations/backend/llamacpp/
But it will be v0.0.20 (or we might finally move up to v0.1.0) because it now supports llamacpp again properly (though no management APIs which we descoped).
Let me know your thoughts or challenges you may come across, I still have a couple more fixes to make so target is 20th or 23rd of October.
I see. Will the llama.cpp binary be integrated into Olla, or should it be called from another container (for example)?
Could you give an example with Docker Compose?
Good news, v0.0.20 is out now, you can grab the latest Olla from ghcr for your Podman/Docker.
Olla isn't really designed to work with a backend directly, it's a bit like nginx, it sits between you and the backend that you need to serve.
Could you give an example with Docker Compose?
There's a small example I did, but using Llamacpp is very model dependant, so I wouldn't be able to do a generic one.
https://github.com/thushan/olla/tree/v0.0.20/examples/claude-code-llamacpp
But it gives the compose layout at least.
Hope that helps.
Buenas noticias, la versión v0.0.20 ya está disponible, puedes obtener la última versión de Olla de ghcr para tu Podman/Docker.
Olla no está realmente diseñado para funcionar directamente con un backend, es un poco como nginx, se ubica entre usted y el backend que necesita servir.
¿Podrías darme un ejemplo con Docker Compose?
Hay un _ pequeño _ ejemplo que hice, pero el uso de Llamacpp depende mucho del modelo, por lo que no podría hacer uno genérico.
https://github.com/thushan/olla/tree/v0.0.20/ejemplos/claude-code-llamacpp
Pero al menos proporciona el diseño de composición.
Espero que esto ayude.
Excuse me, I have a question. Is it possible to use pull or run in Olla (as if it were Ollama)?
Is it possible to use pull or run in Olla (as if it were Ollama)?
Unfortunately not. Olla is just a proxy to serve, not to manage because of the way routing works and load-balancing, it's best effort delivery.