Feature request: allow load/unload models on server
I extract this discussion from https://github.com/ggml-org/llama.cpp/issues/13367 , mainly for better planning tasks around this.
allow loading / unloading model via API: in
server.cpp, we can add a kinda "super"main()function that wraps around the currentmain(). The new main will spawn an "interim" HTTP server that expose the API to load a model. Ofc this functionality will be restricted to local deployment to avoid any security issues.
This idea has been demo in https://github.com/ggml-org/llama.cpp/pull/13400 , but the implementation is still far from usable. It actually requires a refactoring of server.
While alternative methods for hot-swapping model already exist, I think refactoring the server.cpp code can still benefit the long-term development quite a lot. Therefore, this feature can potentially be a suitable goal for the refactoring efforts.
@ngxson -- thank you for working on this! I currently use llama-swap for this functionality, but having taken a cursory look at the implementation, my sense is that it could be far better implemented in llama.cpp.
Minimal UI rework to add an OpenAI-compatible JSON payload for the model selector, compatible with both llama-swap and current or future multi-model server integrations : https://github.com/ggml-org/llama.cpp/pull/16562
https://github.com/user-attachments/assets/0fc0fc66-04de-42ae-ae7e-78638ec823f0
Why dont we just emulate this behavior from Ollama. Loading and unloading of models is extremely fast and convenient in Ollama as compared to llama-swap. I use Ollama only for this convenience.
Why dont we just emulate this behavior from Ollama. Loading and unloading of models is extremely fast and convenient in Ollama as compared to llama-swap. I use Ollama only for this convenience.
I built a minimal “llama-swap” to test the real lower bound of model swap speed in llama-server. It still uses callbacks for process control and SSE chunks, but there’s no artificial delay: it already runs near the hardware limit of loading weights from SSD to RAM/VRAM. Ollama doesn't avoid that; a backend refactor might shave off a few milliseconds like Ollama does, but I’m not sure it would even be perceptible.
https://github.com/user-attachments/assets/349256c0-64d9-49e9-b783-61de83c57014
Some people can benefit from the following methods to speed up disk read performance. https://github.com/ggml-org/llama.cpp/issues/8796
@ngxson -- thank you for #17470 -- excited to see the merge.
Two questions:
- I keep getting this error:
srv operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}. Do you have any suggestions on how to debug? I've tried a few models with--models-dir, but I get the same result. Happy to open an issue if that makes more sense. - I'll have to look into using the API to set the per model configuration, but it would be great if there were some way to have
llama-serverdo this, perhaps via a file. In one use case, end users won't know what the optimal parameters are. Withllama-swapI can default to the recommended ones. Happy to open an issue for this, too.
@dwrz For (1), is there anything show in the log? For (2), it will be added in the next version. I planned to add it in a dedicated PR to allow maintainers to review it easier, prevent chances where security issues can slip through.
Thank you! The logs show the llama-server command it will run, but there's no other relevant logs for the exception.
maybe you can post the relevant log lines (multiple lines) here?
also make sure you started the server with -v to observe debug logs
init: using 15 threads for HTTP server
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8080
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv log_server_r: request: GET / 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /v1/models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv load: spawning server instance with name=gemma-3-27b-it-ud-q8_k_xl on port 35891
srv load: spawning server instance with args:
srv load: llama-server
srv load: --models-dir
srv load: /home/llama-swap/models
srv load: -m
srv load: /home/llama-swap/models/gemma-3-27b-it-ud-q8_k_xl.gguf
srv load: --port
srv load: 35891
srv load: --alias
srv load: gemma-3-27b-it-ud-q8_k_xl
srv operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}
srv log_server_r: request: POST /models/load 127.0.0.1 500
With typical flags:
init: using 15 threads for HTTP server
main: starting router server, no model will be loaded in this process
start: binding port with default address family
main: router server is listening on http://127.0.0.1:8080
main: NOTE: router mode is experimental
main: it is not recommended to use this mode in untrusted environments
srv log_server_r: request: GET / 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /v1/models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv load: spawning server instance with name=gemma-3-27b-it-ud-q8_k_xl on port 44041
srv load: spawning server instance with args:
srv load: llama-server
srv load: --models-dir
srv load: /home/llama-swap/models
srv load: --ctx-size
srv load: 0
srv load: --gpu-layers
srv load: 888
srv load: --jinja
srv load: --min-p
srv load: 0.00
srv load: --repeat-penalty
srv load: 1.0
srv load: --temp
srv load: 1.0
srv load: --top-k
srv load: 64
srv load: --top-p
srv load: 0.95
srv load: -m
srv load: /home/llama-swap/models/gemma-3-27b-it-ud-q8_k_xl.gguf
srv load: --port
srv load: 44041
srv load: --alias
srv load: gemma-3-27b-it-ud-q8_k_xl
srv operator(): got exception: {"error":{"code":500,"message":"failed to spawn server instance","type":"server_error"}}
srv log_server_r: request: POST /models/load 127.0.0.1 500
I'll take a look with the verbose flag.
Update: no other insights with that flag. I see a JSON response for listing the models using the web UI, but nothing more for why the server instance failed to spawn.
@dwrz can you try https://github.com/ggml-org/llama.cpp/pull/17669 to see if it resolves the problem?
btw, by curiosity, how did you install llama-server on your system?
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
#17669 looks to be working for me! I'll try using an mmproj next.
Working with mmproj -- thank you, @ngxson!