Update server.cpp example with correct startup sequence
The HTTP listener start and the health API endpoint are moved before the model loading starts, hence the server can correctly report is loading the model
In any cases, it will be more logical to call
ctx_server.load_model(params)only after all endpoints are registered. Additionally, we can add a middleware to throw 503 if the model is not yet loaded.
Binding them before doesn't work, the model must be loaded. They can be binded afterwards, no issues. There's really no reason to use the other endpoints until the server reports that the model is still being loaded. But indeed I haven't thought about 404 not being the right answer. Made this for ollama which doesn't use any other endpoint.
I will amend it registering the other endpoints with a static 503 answer before listening and re-registering them later once the model is loaded.
Furthermore, this change requires main thread to call
svrto register new endpoints after it is spawned into new thread. This will makesvrnot thread-safe.
You are right I didn't check this. Will try to make it work without re-registering the endpoints at all.
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 468 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=10191.32ms p(95)=27347.77ms fails=, finish reason: stop=422 truncated=46
- Prompt processing (pp): avg=113.71tk/s p(95)=501.27tk/s
- Token generation (tg): avg=24.53tk/s p(95)=38.08tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=mannix-server-startup commit=942f023930ee7b5877034de52fabd3e67aed3589
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 468 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1713596946 --> 1713597588
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 730.6, 730.6, 730.6, 730.6, 730.6, 584.71, 584.71, 584.71, 584.71, 584.71, 607.47, 607.47, 607.47, 607.47, 607.47, 615.83, 615.83, 615.83, 615.83, 615.83, 615.43, 615.43, 615.43, 615.43, 615.43, 615.14, 615.14, 615.14, 615.14, 615.14, 615.4, 615.4, 615.4, 615.4, 615.4, 646.59, 646.59, 646.59, 646.59, 646.59, 645.9, 645.9, 645.9, 645.9, 645.9, 655.65, 655.65, 655.65, 655.65, 655.65, 656.75, 656.75, 656.75, 656.75, 656.75, 656.34, 656.34, 656.34, 656.34, 656.34, 663.6, 663.6, 663.6, 663.6, 663.6, 688.48, 688.48, 688.48, 688.48, 688.48, 699.97, 699.97, 699.97, 699.97, 699.97, 722.0, 722.0, 722.0, 722.0, 722.0, 705.94, 705.94, 705.94, 705.94, 705.94, 707.28, 707.28, 707.28, 707.28, 707.28, 705.76, 705.76, 705.76, 705.76, 705.76, 715.47, 715.47, 715.47, 715.47, 715.47, 718.51, 718.51, 718.51, 718.51, 718.51, 717.75, 717.75, 717.75, 717.75, 717.75, 715.54, 715.54, 715.54, 715.54, 715.54, 713.69, 713.69, 713.69, 713.69, 713.69, 717.94, 717.94, 717.94, 717.94, 717.94, 719.31, 719.31, 719.31, 719.31, 719.31, 729.0, 729.0, 729.0, 729.0, 729.0, 729.61, 729.61, 729.61, 729.61, 729.61, 728.24, 728.24, 728.24, 728.24, 728.24, 729.5, 729.5, 729.5, 729.5, 729.5, 736.57, 736.57, 736.57, 736.57, 736.57, 734.21, 734.21, 734.21, 734.21, 734.21, 734.31, 734.31, 734.31, 734.31, 734.31, 734.41, 734.41, 734.41, 734.41, 734.41, 738.58, 738.58, 738.58, 738.58, 738.58, 737.79, 737.79, 737.79, 737.79, 737.79, 736.56, 736.56, 736.56, 736.56, 736.56, 739.18, 739.18, 739.18, 739.18, 739.18, 742.32, 742.32, 742.32, 742.32, 742.32, 751.88, 751.88, 751.88, 751.88, 751.88, 753.41, 753.41, 753.41, 753.41, 753.41, 726.26, 726.26, 726.26, 726.26, 726.26, 724.12, 724.12, 724.12, 724.12, 724.12, 724.66, 724.66, 724.66, 724.66, 724.66, 726.15, 726.15, 726.15, 726.15, 726.15, 727.77, 727.77, 727.77, 727.77, 727.77, 725.48, 725.48, 725.48, 725.48, 725.48, 722.59, 722.59, 722.59, 722.59, 722.59, 721.69, 721.69, 721.69, 721.69, 721.69, 711.25, 711.25, 711.25, 711.25, 711.25, 709.65, 709.65, 709.65, 709.65, 709.65, 708.81, 708.81, 708.81, 708.81, 708.81, 706.64, 706.64, 706.64, 706.64, 706.64, 709.56, 709.56, 709.56, 709.56, 709.56, 709.33, 709.33, 709.33, 709.33, 709.33, 709.22, 709.22, 709.22, 709.22, 709.22, 711.16, 711.16, 711.16, 711.16, 711.16, 712.47, 712.47, 712.47, 712.47, 712.47, 715.74, 715.74, 715.74, 715.74, 715.74, 717.7, 717.7, 717.7, 717.7, 717.7, 718.84, 718.84, 718.84, 718.84, 718.84, 718.84, 718.84]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 468 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1713596946 --> 1713597588
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.99, 30.99, 30.99, 30.99, 30.99, 24.91, 24.91, 24.91, 24.91, 24.91, 24.96, 24.96, 24.96, 24.96, 24.96, 27.25, 27.25, 27.25, 27.25, 27.25, 27.1, 27.1, 27.1, 27.1, 27.1, 27.13, 27.13, 27.13, 27.13, 27.13, 27.09, 27.09, 27.09, 27.09, 27.09, 28.16, 28.16, 28.16, 28.16, 28.16, 28.18, 28.18, 28.18, 28.18, 28.18, 27.87, 27.87, 27.87, 27.87, 27.87, 26.92, 26.92, 26.92, 26.92, 26.92, 26.05, 26.05, 26.05, 26.05, 26.05, 25.75, 25.75, 25.75, 25.75, 25.75, 25.74, 25.74, 25.74, 25.74, 25.74, 24.69, 24.69, 24.69, 24.69, 24.69, 24.4, 24.4, 24.4, 24.4, 24.4, 23.81, 23.81, 23.81, 23.81, 23.81, 23.69, 23.69, 23.69, 23.69, 23.69, 23.79, 23.79, 23.79, 23.79, 23.79, 23.92, 23.92, 23.92, 23.92, 23.92, 23.53, 23.53, 23.53, 23.53, 23.53, 23.11, 23.11, 23.11, 23.11, 23.11, 22.92, 22.92, 22.92, 22.92, 22.92, 22.6, 22.6, 22.6, 22.6, 22.6, 22.41, 22.41, 22.41, 22.41, 22.41, 22.48, 22.48, 22.48, 22.48, 22.48, 22.59, 22.59, 22.59, 22.59, 22.59, 22.44, 22.44, 22.44, 22.44, 22.44, 22.67, 22.67, 22.67, 22.67, 22.67, 22.86, 22.86, 22.86, 22.86, 22.86, 22.84, 22.84, 22.84, 22.84, 22.84, 22.63, 22.63, 22.63, 22.63, 22.63, 22.61, 22.61, 22.61, 22.61, 22.61, 22.86, 22.86, 22.86, 22.86, 22.86, 23.05, 23.05, 23.05, 23.05, 23.05, 23.11, 23.11, 23.11, 23.11, 23.11, 23.24, 23.24, 23.24, 23.24, 23.24, 23.35, 23.35, 23.35, 23.35, 23.35, 23.36, 23.36, 23.36, 23.36, 23.36, 23.26, 23.26, 23.26, 23.26, 23.26, 23.2, 23.2, 23.2, 23.2, 23.2, 23.19, 23.19, 23.19, 23.19, 23.19, 23.15, 23.15, 23.15, 23.15, 23.15, 23.16, 23.16, 23.16, 23.16, 23.16, 23.24, 23.24, 23.24, 23.24, 23.24, 23.34, 23.34, 23.34, 23.34, 23.34, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.42, 23.42, 23.42, 23.42, 23.42, 23.22, 23.22, 23.22, 23.22, 23.22, 22.9, 22.9, 22.9, 22.9, 22.9, 22.8, 22.8, 22.8, 22.8, 22.8, 22.1, 22.1, 22.1, 22.1, 22.1, 21.73, 21.73, 21.73, 21.73, 21.73, 21.71, 21.71, 21.71, 21.71, 21.71, 21.78, 21.78, 21.78, 21.78, 21.78, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 22.0, 22.0, 22.0, 22.0, 22.0, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 21.99, 21.99]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 468 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1713596946 --> 1713597588
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.4, 0.4, 0.4, 0.4, 0.4, 0.43, 0.43, 0.43, 0.43, 0.43, 0.38, 0.38, 0.38, 0.38, 0.38, 0.45, 0.45, 0.45, 0.45, 0.45, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.31, 0.31, 0.31, 0.31, 0.31, 0.37, 0.37]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 468 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1713596946 --> 1713597588
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0]
Binding them before doesn't work, the model must be loaded.
The reference to ctx_server is never changed before/after model is loaded. If you see errors, that maybe because something accesses to ctx_server before model is loaded (for example, checking chat template). You should move them all to below code block where all endpoints are registered and HTTP is listening.
Another thing to add is inside svr->set_pre_routing_handler, there should be a middleware to check if we're accessing endpoints other than /health. If model is not loaded, the middleware must return 503 error.
@ngxson
Can you please check it again and let me know?
The empty_json_model() is just a leftover of course.
What I'm wondering is if the middleware should report error 500 also for the static pages.
Thanks!
What I'm wondering is if the middleware should report error 500 also for the static pages
If we can return the static pages while the model is loading, this is fine.
@ngxson Seems to work, let me know. Thanks a lot!