llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Update server.cpp example with correct startup sequence

Open mann1x opened this issue 1 year ago • 7 comments

The HTTP listener start and the health API endpoint are moved before the model loading starts, hence the server can correctly report is loading the model

mann1x avatar Apr 18 '24 10:04 mann1x

In any cases, it will be more logical to call ctx_server.load_model(params) only after all endpoints are registered. Additionally, we can add a middleware to throw 503 if the model is not yet loaded.

Binding them before doesn't work, the model must be loaded. They can be binded afterwards, no issues. There's really no reason to use the other endpoints until the server reports that the model is still being loaded. But indeed I haven't thought about 404 not being the right answer. Made this for ollama which doesn't use any other endpoint.

I will amend it registering the other endpoints with a static 503 answer before listening and re-registering them later once the model is loaded.

mann1x avatar Apr 18 '24 10:04 mann1x

Furthermore, this change requires main thread to call svr to register new endpoints after it is spawned into new thread. This will make svr not thread-safe.

You are right I didn't check this. Will try to make it work without re-registering the endpoints at all.

mann1x avatar Apr 18 '24 10:04 mann1x

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 468 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10191.32ms p(95)=27347.77ms fails=, finish reason: stop=422 truncated=46
  • Prompt processing (pp): avg=113.71tk/s p(95)=501.27tk/s
  • Token generation (tg): avg=24.53tk/s p(95)=38.08tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=mannix-server-startup commit=942f023930ee7b5877034de52fabd3e67aed3589

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 468 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713596946 --> 1713597588
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 730.6, 730.6, 730.6, 730.6, 730.6, 584.71, 584.71, 584.71, 584.71, 584.71, 607.47, 607.47, 607.47, 607.47, 607.47, 615.83, 615.83, 615.83, 615.83, 615.83, 615.43, 615.43, 615.43, 615.43, 615.43, 615.14, 615.14, 615.14, 615.14, 615.14, 615.4, 615.4, 615.4, 615.4, 615.4, 646.59, 646.59, 646.59, 646.59, 646.59, 645.9, 645.9, 645.9, 645.9, 645.9, 655.65, 655.65, 655.65, 655.65, 655.65, 656.75, 656.75, 656.75, 656.75, 656.75, 656.34, 656.34, 656.34, 656.34, 656.34, 663.6, 663.6, 663.6, 663.6, 663.6, 688.48, 688.48, 688.48, 688.48, 688.48, 699.97, 699.97, 699.97, 699.97, 699.97, 722.0, 722.0, 722.0, 722.0, 722.0, 705.94, 705.94, 705.94, 705.94, 705.94, 707.28, 707.28, 707.28, 707.28, 707.28, 705.76, 705.76, 705.76, 705.76, 705.76, 715.47, 715.47, 715.47, 715.47, 715.47, 718.51, 718.51, 718.51, 718.51, 718.51, 717.75, 717.75, 717.75, 717.75, 717.75, 715.54, 715.54, 715.54, 715.54, 715.54, 713.69, 713.69, 713.69, 713.69, 713.69, 717.94, 717.94, 717.94, 717.94, 717.94, 719.31, 719.31, 719.31, 719.31, 719.31, 729.0, 729.0, 729.0, 729.0, 729.0, 729.61, 729.61, 729.61, 729.61, 729.61, 728.24, 728.24, 728.24, 728.24, 728.24, 729.5, 729.5, 729.5, 729.5, 729.5, 736.57, 736.57, 736.57, 736.57, 736.57, 734.21, 734.21, 734.21, 734.21, 734.21, 734.31, 734.31, 734.31, 734.31, 734.31, 734.41, 734.41, 734.41, 734.41, 734.41, 738.58, 738.58, 738.58, 738.58, 738.58, 737.79, 737.79, 737.79, 737.79, 737.79, 736.56, 736.56, 736.56, 736.56, 736.56, 739.18, 739.18, 739.18, 739.18, 739.18, 742.32, 742.32, 742.32, 742.32, 742.32, 751.88, 751.88, 751.88, 751.88, 751.88, 753.41, 753.41, 753.41, 753.41, 753.41, 726.26, 726.26, 726.26, 726.26, 726.26, 724.12, 724.12, 724.12, 724.12, 724.12, 724.66, 724.66, 724.66, 724.66, 724.66, 726.15, 726.15, 726.15, 726.15, 726.15, 727.77, 727.77, 727.77, 727.77, 727.77, 725.48, 725.48, 725.48, 725.48, 725.48, 722.59, 722.59, 722.59, 722.59, 722.59, 721.69, 721.69, 721.69, 721.69, 721.69, 711.25, 711.25, 711.25, 711.25, 711.25, 709.65, 709.65, 709.65, 709.65, 709.65, 708.81, 708.81, 708.81, 708.81, 708.81, 706.64, 706.64, 706.64, 706.64, 706.64, 709.56, 709.56, 709.56, 709.56, 709.56, 709.33, 709.33, 709.33, 709.33, 709.33, 709.22, 709.22, 709.22, 709.22, 709.22, 711.16, 711.16, 711.16, 711.16, 711.16, 712.47, 712.47, 712.47, 712.47, 712.47, 715.74, 715.74, 715.74, 715.74, 715.74, 717.7, 717.7, 717.7, 717.7, 717.7, 718.84, 718.84, 718.84, 718.84, 718.84, 718.84, 718.84]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 468 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713596946 --> 1713597588
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.99, 30.99, 30.99, 30.99, 30.99, 24.91, 24.91, 24.91, 24.91, 24.91, 24.96, 24.96, 24.96, 24.96, 24.96, 27.25, 27.25, 27.25, 27.25, 27.25, 27.1, 27.1, 27.1, 27.1, 27.1, 27.13, 27.13, 27.13, 27.13, 27.13, 27.09, 27.09, 27.09, 27.09, 27.09, 28.16, 28.16, 28.16, 28.16, 28.16, 28.18, 28.18, 28.18, 28.18, 28.18, 27.87, 27.87, 27.87, 27.87, 27.87, 26.92, 26.92, 26.92, 26.92, 26.92, 26.05, 26.05, 26.05, 26.05, 26.05, 25.75, 25.75, 25.75, 25.75, 25.75, 25.74, 25.74, 25.74, 25.74, 25.74, 24.69, 24.69, 24.69, 24.69, 24.69, 24.4, 24.4, 24.4, 24.4, 24.4, 23.81, 23.81, 23.81, 23.81, 23.81, 23.69, 23.69, 23.69, 23.69, 23.69, 23.79, 23.79, 23.79, 23.79, 23.79, 23.92, 23.92, 23.92, 23.92, 23.92, 23.53, 23.53, 23.53, 23.53, 23.53, 23.11, 23.11, 23.11, 23.11, 23.11, 22.92, 22.92, 22.92, 22.92, 22.92, 22.6, 22.6, 22.6, 22.6, 22.6, 22.41, 22.41, 22.41, 22.41, 22.41, 22.48, 22.48, 22.48, 22.48, 22.48, 22.59, 22.59, 22.59, 22.59, 22.59, 22.44, 22.44, 22.44, 22.44, 22.44, 22.67, 22.67, 22.67, 22.67, 22.67, 22.86, 22.86, 22.86, 22.86, 22.86, 22.84, 22.84, 22.84, 22.84, 22.84, 22.63, 22.63, 22.63, 22.63, 22.63, 22.61, 22.61, 22.61, 22.61, 22.61, 22.86, 22.86, 22.86, 22.86, 22.86, 23.05, 23.05, 23.05, 23.05, 23.05, 23.11, 23.11, 23.11, 23.11, 23.11, 23.24, 23.24, 23.24, 23.24, 23.24, 23.35, 23.35, 23.35, 23.35, 23.35, 23.36, 23.36, 23.36, 23.36, 23.36, 23.26, 23.26, 23.26, 23.26, 23.26, 23.2, 23.2, 23.2, 23.2, 23.2, 23.19, 23.19, 23.19, 23.19, 23.19, 23.15, 23.15, 23.15, 23.15, 23.15, 23.16, 23.16, 23.16, 23.16, 23.16, 23.24, 23.24, 23.24, 23.24, 23.24, 23.34, 23.34, 23.34, 23.34, 23.34, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 23.42, 23.42, 23.42, 23.42, 23.42, 23.22, 23.22, 23.22, 23.22, 23.22, 22.9, 22.9, 22.9, 22.9, 22.9, 22.8, 22.8, 22.8, 22.8, 22.8, 22.1, 22.1, 22.1, 22.1, 22.1, 21.73, 21.73, 21.73, 21.73, 21.73, 21.71, 21.71, 21.71, 21.71, 21.71, 21.78, 21.78, 21.78, 21.78, 21.78, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 21.86, 22.0, 22.0, 22.0, 22.0, 22.0, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 22.01, 21.99, 21.99]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 468 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713596946 --> 1713597588
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.4, 0.4, 0.4, 0.4, 0.4, 0.43, 0.43, 0.43, 0.43, 0.43, 0.38, 0.38, 0.38, 0.38, 0.38, 0.45, 0.45, 0.45, 0.45, 0.45, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.31, 0.31, 0.31, 0.31, 0.31, 0.37, 0.37]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 468 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713596946 --> 1713597588
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0]
                    

github-actions[bot] avatar Apr 18 '24 10:04 github-actions[bot]

Binding them before doesn't work, the model must be loaded.

The reference to ctx_server is never changed before/after model is loaded. If you see errors, that maybe because something accesses to ctx_server before model is loaded (for example, checking chat template). You should move them all to below code block where all endpoints are registered and HTTP is listening.

Another thing to add is inside svr->set_pre_routing_handler, there should be a middleware to check if we're accessing endpoints other than /health. If model is not loaded, the middleware must return 503 error.

ngxson avatar Apr 18 '24 10:04 ngxson

@ngxson

Can you please check it again and let me know?

The empty_json_model() is just a leftover of course.

What I'm wondering is if the middleware should report error 500 also for the static pages.

Thanks!

mann1x avatar Apr 18 '24 16:04 mann1x

What I'm wondering is if the middleware should report error 500 also for the static pages

If we can return the static pages while the model is loading, this is fine.

phymbert avatar Apr 18 '24 18:04 phymbert

@ngxson Seems to work, let me know. Thanks a lot!

mann1x avatar Apr 19 '24 17:04 mann1x