llama.cpp Feature request: allow load/unload models on server

I extract this discussion from https://github.com/ggml-org/llama.cpp/issues/13367 , mainly for better planning tasks around this.

allow loading / unloading model via API: in server.cpp, we can add a kinda "super" main() function that wraps around the current main(). The new main will spawn an "interim" HTTP server that expose the API to load a model. Ofc this functionality will be restricted to local deployment to avoid any security issues.

This idea has been demo in https://github.com/ggml-org/llama.cpp/pull/13400 , but the implementation is still far from usable. It actually requires a refactoring of server.

While alternative methods for hot-swapping model already exist, I think refactoring the server.cpp code can still benefit the long-term development quite a lot. Therefore, this feature can potentially be a suitable goal for the refactoring efforts.

Oct 09 '25 14:10 ngxson

@ngxson -- thank you for working on this! I currently use llama-swap for this functionality, but having taken a cursory look at the implementation, my sense is that it could be far better implemented in llama.cpp.

Oct 12 '25 12:10 dwrz

Minimal UI rework to add an OpenAI-compatible JSON payload for the model selector, compatible with both llama-swap and current or future multi-model server integrations : https://github.com/ggml-org/llama.cpp/pull/16562

https://github.com/user-attachments/assets/0fc0fc66-04de-42ae-ae7e-78638ec823f0

Oct 13 '25 14:10 ServeurpersoCom

Why dont we just emulate this behavior from Ollama. Loading and unloading of models is extremely fast and convenient in Ollama as compared to llama-swap. I use Ollama only for this convenience.

Oct 25 '25 04:10 neeraj-j

Why dont we just emulate this behavior from Ollama. Loading and unloading of models is extremely fast and convenient in Ollama as compared to llama-swap. I use Ollama only for this convenience.

I built a minimal “llama-swap” to test the real lower bound of model swap speed in llama-server. It still uses callbacks for process control and SSE chunks, but there’s no artificial delay: it already runs near the hardware limit of loading weights from SSD to RAM/VRAM. Ollama doesn't avoid that; a backend refactor might shave off a few milliseconds like Ollama does, but I’m not sure it would even be perceptible.

https://github.com/user-attachments/assets/349256c0-64d9-49e9-b783-61de83c57014

Oct 27 '25 12:10 ServeurpersoCom

Some people can benefit from the following methods to speed up disk read performance. https://github.com/ggml-org/llama.cpp/issues/8796

Nov 14 '25 08:11 rankaiyx

@ngxson -- thank you for #17470 -- excited to see the merge.

Two questions:

I keep getting this error: srv operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}. Do you have any suggestions on how to debug? I've tried a few models with --models-dir, but I get the same result. Happy to open an issue if that makes more sense.
I'll have to look into using the API to set the per model configuration, but it would be great if there were some way to have llama-server do this, perhaps via a file. In one use case, end users won't know what the optimal parameters are. With llama-swap I can default to the recommended ones. Happy to open an issue for this, too.

Dec 01 '25 19:12 dwrz

@dwrz For (1), is there anything show in the log? For (2), it will be added in the next version. I planned to add it in a dedicated PR to allow maintainers to review it easier, prevent chances where security issues can slip through.

Dec 01 '25 19:12 ngxson

Thank you! The logs show the llama-server command it will run, but there's no other relevant logs for the exception.

Dec 01 '25 19:12 dwrz

maybe you can post the relevant log lines (multiple lines) here?

also make sure you started the server with -v to observe debug logs

Dec 01 '25 21:12 ngxson

init: using 15 threads for HTTP server
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8080
main: NOTE: router mode is experimental
main:       it is not recommended to use this mode in untrusted environments
srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  log_server_r: request: GET /models 127.0.0.1 200
srv          load: spawning server instance with name=gemma-3-27b-it-ud-q8_k_xl on port 35891
srv          load: spawning server instance with args:
srv          load:   llama-server
srv          load:   --models-dir
srv          load:   /home/llama-swap/models
srv          load:   -m
srv          load:   /home/llama-swap/models/gemma-3-27b-it-ud-q8_k_xl.gguf
srv          load:   --port
srv          load:   35891
srv          load:   --alias
srv          load:   gemma-3-27b-it-ud-q8_k_xl
srv    operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}
srv  log_server_r: request: POST /models/load 127.0.0.1 500

With typical flags:

init: using 15 threads for HTTP server
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8080
main: NOTE: router mode is experimental
main:       it is not recommended to use this mode in untrusted environments
srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  log_server_r: request: GET /models 127.0.0.1 200
srv          load: spawning server instance with name=gemma-3-27b-it-ud-q8_k_xl on port 44041
srv          load: spawning server instance with args:
srv          load:   llama-server
srv          load:   --models-dir
srv          load:   /home/llama-swap/models
srv          load:   --ctx-size
srv          load:   0
srv          load:   --gpu-layers
srv          load:   888
srv          load:   --jinja
srv          load:   --min-p
srv          load:   0.00
srv          load:   --repeat-penalty
srv          load:   1.0
srv          load:   --temp
srv          load:   1.0
srv          load:   --top-k
srv          load:   64
srv          load:   --top-p
srv          load:   0.95
srv          load:   -m
srv          load:   /home/llama-swap/models/gemma-3-27b-it-ud-q8_k_xl.gguf
srv          load:   --port
srv          load:   44041
srv          load:   --alias
srv          load:   gemma-3-27b-it-ud-q8_k_xl
srv    operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}
srv  log_server_r: request: POST /models/load 127.0.0.1 500

I'll take a look with the verbose flag.

Update: no other insights with that flag. I see a JSON response for listing the models using the web UI, but nothing more for why the server instance failed to spawn.

Dec 01 '25 21:12 dwrz

@dwrz can you try https://github.com/ggml-org/llama.cpp/pull/17669 to see if it resolves the problem?

btw, by curiosity, how did you install llama-server on your system?

Dec 01 '25 22:12 ngxson

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

#17669 looks to be working for me! I'll try using an mmproj next.

Working with mmproj -- thank you, @ngxson!

Dec 01 '25 22:12 dwrz