llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Server: Unix Socket Support

Open adrianliechti opened this issue 3 months ago • 6 comments

The idea of this pull request is to ease integration of llama.cpp server using unix sockets instead tcp. cpp-httplib has support for unix sockets built in: https://github.com/yhirose/cpp-httplib/pull/1346

my idea was to not add an additional parameter, but use a --host prefix: unix:// (similar to docker's client/server pattern).

a very first attempt is here, mainly to understand if this is something you could imagine in the code.

(the file should not exist before)
./server --host unix:///tmp/llama.sock --model ~/Projects/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

connect using socat

socat TCP-LISTEN:1234,fork UNIX-CONNECT:/tmp/llama.sock
curl http://localhost:1234/v1/model

connect using curl:

curl --unix-sock /tmp/llama.sock http://localhost/v1/models

open points:

  • make path absolute?
  • some error handling?

adrianliechti avatar Mar 31 '24 22:03 adrianliechti

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 534 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8746.54ms p(90)=25799.57ms fails=0, finish reason: stop=534 truncated=0
  • Prompt processing (pp): avg=235.76tk/s p(90)=696.9tk/s total=206.45tk/s
  • Token generation (tg): avg=100.17tk/s p(90)=269.19tk/s total=131.19tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=0b70ac0f6606fd1583afeed5a0bacec035d34444
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 583.52, 583.52, 583.52, 583.52, 583.52, 668.32, 668.32, 668.32, 668.32, 668.32, 678.72, 678.72, 678.72, 678.72, 678.72, 683.72, 683.72, 683.72, 683.72, 683.72, 716.03, 716.03, 716.03, 716.03, 716.03, 710.09, 710.09, 710.09, 710.09, 710.09, 699.07, 699.07, 699.07, 699.07, 699.07, 676.52, 676.52, 676.52, 676.52, 676.52, 688.86, 688.86, 688.86, 688.86, 688.86, 688.17, 688.17, 688.17, 688.17, 688.17, 704.66, 704.66, 704.66, 704.66, 704.66, 724.39, 724.39, 724.39, 724.39, 724.39, 713.05, 713.05, 713.05, 713.05, 713.05, 711.05, 711.05, 711.05, 711.05, 711.05, 704.3, 704.3, 704.3, 704.3, 704.3, 708.47, 708.47, 708.47, 708.47, 708.47, 707.52, 707.52, 707.52, 707.52, 707.52, 715.17, 715.17, 715.17, 715.17, 715.17, 713.96, 713.96, 713.96, 713.96, 713.96, 713.15, 713.15, 713.15, 713.15, 713.15, 711.88, 711.88, 711.88, 711.88, 711.88, 711.97, 711.97, 711.97, 711.97, 711.97, 714.76, 714.76, 714.76, 714.76, 714.76, 720.86, 720.86, 720.86, 720.86, 720.86, 727.2, 727.2, 727.2, 727.2, 727.2, 728.22, 728.22, 728.22, 728.22, 728.22, 728.36, 728.36, 728.36, 728.36, 728.36, 734.39, 734.39, 734.39, 734.39, 734.39, 730.88, 730.88, 730.88, 730.88, 730.88, 729.13, 729.13, 729.13, 729.13, 729.13, 730.0, 730.0, 730.0, 730.0, 730.0, 731.2, 731.2, 731.2, 731.2, 731.2, 730.72, 730.72, 730.72, 730.72, 730.72, 731.03, 731.03, 731.03, 731.03, 731.03, 731.58, 731.58, 731.58, 731.58, 731.58, 737.91, 737.91, 737.91, 737.91, 737.91, 740.62, 740.62, 740.62, 740.62, 740.62, 740.69, 740.69, 740.69, 740.69, 740.69, 738.85, 738.85, 738.85, 738.85, 738.85, 737.25, 737.25, 737.25, 737.25, 737.25, 739.99, 739.99, 739.99, 739.99, 739.99, 743.24, 743.24, 743.24, 743.24, 743.24, 744.39, 744.39, 744.39, 744.39, 744.39, 722.39, 722.39, 722.39, 722.39, 722.39, 720.43, 720.43, 720.43, 720.43, 720.43, 712.51, 712.51, 712.51, 712.51, 712.51, 711.54, 711.54, 711.54, 711.54, 711.54, 710.12, 710.12, 710.12, 710.12, 710.12, 709.72, 709.72, 709.72, 709.72, 709.72, 712.14, 712.14, 712.14, 712.14, 712.14, 712.04, 712.04, 712.04, 712.04, 712.04, 706.51, 706.51, 706.51, 706.51, 706.51, 704.92, 704.92, 704.92, 704.92, 704.92, 707.56, 707.56, 707.56, 707.56, 707.56, 707.65, 707.65, 707.65, 707.65, 707.65, 705.13, 705.13, 705.13, 705.13, 705.13, 706.09, 706.09, 706.09, 706.09, 706.09, 706.07, 706.07, 706.07, 706.07, 706.07, 705.99, 705.99, 705.99, 705.99, 705.99, 707.16, 707.16, 707.16, 707.16, 707.16]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.19, 29.19, 29.19, 29.19, 29.19, 16.76, 16.76, 16.76, 16.76, 16.76, 17.16, 17.16, 17.16, 17.16, 17.16, 17.39, 17.39, 17.39, 17.39, 17.39, 17.47, 17.47, 17.47, 17.47, 17.47, 17.85, 17.85, 17.85, 17.85, 17.85, 18.75, 18.75, 18.75, 18.75, 18.75, 19.26, 19.26, 19.26, 19.26, 19.26, 19.53, 19.53, 19.53, 19.53, 19.53, 19.63, 19.63, 19.63, 19.63, 19.63, 19.94, 19.94, 19.94, 19.94, 19.94, 19.82, 19.82, 19.82, 19.82, 19.82, 19.44, 19.44, 19.44, 19.44, 19.44, 19.13, 19.13, 19.13, 19.13, 19.13, 18.83, 18.83, 18.83, 18.83, 18.83, 18.46, 18.46, 18.46, 18.46, 18.46, 18.59, 18.59, 18.59, 18.59, 18.59, 18.74, 18.74, 18.74, 18.74, 18.74, 18.58, 18.58, 18.58, 18.58, 18.58, 18.49, 18.49, 18.49, 18.49, 18.49, 18.41, 18.41, 18.41, 18.41, 18.41, 18.2, 18.2, 18.2, 18.2, 18.2, 18.18, 18.18, 18.18, 18.18, 18.18, 18.28, 18.28, 18.28, 18.28, 18.28, 18.2, 18.2, 18.2, 18.2, 18.2, 18.27, 18.27, 18.27, 18.27, 18.27, 18.32, 18.32, 18.32, 18.32, 18.32, 18.38, 18.38, 18.38, 18.38, 18.38, 18.27, 18.27, 18.27, 18.27, 18.27, 18.19, 18.19, 18.19, 18.19, 18.19, 18.28, 18.28, 18.28, 18.28, 18.28, 18.32, 18.32, 18.32, 18.32, 18.32, 18.35, 18.35, 18.35, 18.35, 18.35, 18.49, 18.49, 18.49, 18.49, 18.49, 18.55, 18.55, 18.55, 18.55, 18.55, 18.51, 18.51, 18.51, 18.51, 18.51, 18.44, 18.44, 18.44, 18.44, 18.44, 18.36, 18.36, 18.36, 18.36, 18.36, 18.34, 18.34, 18.34, 18.34, 18.34, 18.4, 18.4, 18.4, 18.4, 18.4, 18.42, 18.42, 18.42, 18.42, 18.42, 18.48, 18.48, 18.48, 18.48, 18.48, 18.43, 18.43, 18.43, 18.43, 18.43, 18.25, 18.25, 18.25, 18.25, 18.25, 18.21, 18.21, 18.21, 18.21, 18.21, 17.95, 17.95, 17.95, 17.95, 17.95, 17.93, 17.93, 17.93, 17.93, 17.93, 17.74, 17.74, 17.74, 17.74, 17.74, 17.42, 17.42, 17.42, 17.42, 17.42, 17.4, 17.4, 17.4, 17.4, 17.4, 17.47, 17.47, 17.47, 17.47, 17.47, 17.48, 17.48, 17.48, 17.48, 17.48, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58, 17.58, 17.61, 17.61, 17.61, 17.61, 17.61, 17.62, 17.62, 17.62, 17.62, 17.62, 17.68, 17.68, 17.68, 17.68, 17.68, 17.75, 17.75, 17.75, 17.75, 17.75, 17.83, 17.83, 17.83, 17.83, 17.83]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.43, 0.43, 0.43, 0.43, 0.43, 0.48, 0.48, 0.48, 0.48, 0.48, 0.4, 0.4, 0.4, 0.4, 0.4, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    

github-actions[bot] avatar Mar 31 '24 22:03 github-actions[bot]

This should be tested in all platforms

FSSRepo avatar Apr 01 '24 00:04 FSSRepo

@FSSRepo you are right. i tested on linux and macos, and added a ifndef for windows - similar to httplib's implementation. do you have more platforms in mind?

@phymbert might you have some guidance here? shall i add a sample shell script or extend the python test suite? i mainly ask because i don't want to slow down every test cycle for such a niche feature...

adrianliechti avatar Apr 01 '24 10:04 adrianliechti

might you have some guidance here? shall i add a sample shell script or extend the python test suite?

I suggest adding a simple dedicated scenario in a new feature using unix://. I hope no additional changes are required since we already checked the sock family in the python glue. Regarding the overhead of the new scenario, we are using a very small model, so adding a new scenario matters in seconds. It's OK.

phymbert avatar Apr 01 '24 10:04 phymbert

Regarding server tests, @phymbert has provided quite good documentation over here: https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

ggerganov avatar Apr 01 '24 11:04 ggerganov

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

Yes, a good example is:

  • https://github.com/ggerganov/llama.cpp/pull/6341/files#diff-078c109cd25774fd54e338cc718c4ed8ffafa3e1af520e584dd2de0319ad7a66R1
  • in #6341

phymbert avatar Apr 01 '24 13:04 phymbert

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 542 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8613.69ms p(95)=20673.14ms fails=, finish reason: stop=468 truncated=74
  • Prompt processing (pp): avg=107.36tk/s p(95)=485.37tk/s
  • Token generation (tg): avg=34.28tk/s p(95)=46.64tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=b2d3dd3cc9945bd02ef15bdb48ae50c51759c64e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715205540 --> 1715206170
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 920.8, 920.8, 920.8, 920.8, 920.8, 637.77, 637.77, 637.77, 637.77, 637.77, 666.77, 666.77, 666.77, 666.77, 666.77, 699.0, 699.0, 699.0, 699.0, 699.0, 762.26, 762.26, 762.26, 762.26, 762.26, 760.76, 760.76, 760.76, 760.76, 760.76, 768.27, 768.27, 768.27, 768.27, 768.27, 793.17, 793.17, 793.17, 793.17, 793.17, 808.69, 808.69, 808.69, 808.69, 808.69, 808.94, 808.94, 808.94, 808.94, 808.94, 808.38, 808.38, 808.38, 808.38, 808.38, 825.2, 825.2, 825.2, 825.2, 825.2, 829.76, 829.76, 829.76, 829.76, 829.76, 823.68, 823.68, 823.68, 823.68, 823.68, 813.97, 813.97, 813.97, 813.97, 813.97, 807.82, 807.82, 807.82, 807.82, 807.82, 806.81, 806.81, 806.81, 806.81, 806.81, 826.55, 826.55, 826.55, 826.55, 826.55, 826.72, 826.72, 826.72, 826.72, 826.72, 832.77, 832.77, 832.77, 832.77, 832.77, 832.62, 832.62, 832.62, 832.62, 832.62, 836.65, 836.65, 836.65, 836.65, 836.65, 809.52, 809.52, 809.52, 809.52, 809.52, 812.94, 812.94, 812.94, 812.94, 812.94, 813.09, 813.09, 813.09, 813.09, 813.09, 827.3, 827.3, 827.3, 827.3, 827.3, 824.63, 824.63, 824.63, 824.63, 824.63, 821.72, 821.72, 821.72, 821.72, 821.72, 817.94, 817.94, 817.94, 817.94, 817.94, 824.16, 824.16, 824.16, 824.16, 824.16, 824.01, 824.01, 824.01, 824.01, 824.01, 823.25, 823.25, 823.25, 823.25, 823.25, 826.57, 826.57, 826.57, 826.57, 826.57, 839.62, 839.62, 839.62, 839.62, 839.62, 846.98, 846.98, 846.98, 846.98, 846.98, 855.95, 855.95, 855.95, 855.95, 855.95, 855.74, 855.74, 855.74, 855.74, 855.74, 853.9, 853.9, 853.9, 853.9, 853.9, 854.48, 854.48, 854.48, 854.48, 854.48, 857.75, 857.75, 857.75, 857.75, 857.75, 859.79, 859.79, 859.79, 859.79, 859.79, 855.88, 855.88, 855.88, 855.88, 855.88, 833.49, 833.49, 833.49, 833.49, 833.49, 833.18, 833.18, 833.18, 833.18, 833.18, 831.98, 831.98, 831.98, 831.98, 831.98, 832.35, 832.35, 832.35, 832.35, 832.35, 837.65, 837.65, 837.65, 837.65, 837.65, 837.12, 837.12, 837.12, 837.12, 837.12, 840.03, 840.03, 840.03, 840.03, 840.03, 836.77, 836.77, 836.77, 836.77, 836.77, 841.36, 841.36, 841.36, 841.36, 841.36, 845.09, 845.09, 845.09, 845.09, 845.09, 843.61, 843.61, 843.61, 843.61, 843.61, 839.86, 839.86, 839.86, 839.86, 839.86, 838.54, 838.54, 838.54, 838.54, 838.54, 840.25, 840.25, 840.25, 840.25, 840.25, 840.15, 840.15, 840.15, 840.15, 840.15, 841.5, 841.5, 841.5, 841.5, 841.5, 843.42, 843.42, 843.42, 843.42, 843.42, 843.52, 843.52, 843.52, 843.52, 843.52, 843.52]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715205540 --> 1715206170
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.41, 41.41, 41.41, 41.41, 41.41, 32.3, 32.3, 32.3, 32.3, 32.3, 32.22, 32.22, 32.22, 32.22, 32.22, 32.9, 32.9, 32.9, 32.9, 32.9, 33.05, 33.05, 33.05, 33.05, 33.05, 33.25, 33.25, 33.25, 33.25, 33.25, 34.34, 34.34, 34.34, 34.34, 34.34, 34.74, 34.74, 34.74, 34.74, 34.74, 34.95, 34.95, 34.95, 34.95, 34.95, 34.78, 34.78, 34.78, 34.78, 34.78, 34.22, 34.22, 34.22, 34.22, 34.22, 34.07, 34.07, 34.07, 34.07, 34.07, 33.49, 33.49, 33.49, 33.49, 33.49, 33.29, 33.29, 33.29, 33.29, 33.29, 32.68, 32.68, 32.68, 32.68, 32.68, 32.4, 32.4, 32.4, 32.4, 32.4, 32.53, 32.53, 32.53, 32.53, 32.53, 32.5, 32.5, 32.5, 32.5, 32.5, 32.43, 32.43, 32.43, 32.43, 32.43, 32.44, 32.44, 32.44, 32.44, 32.44, 32.23, 32.23, 32.23, 32.23, 32.23, 32.38, 32.38, 32.38, 32.38, 32.38, 32.43, 32.43, 32.43, 32.43, 32.43, 32.59, 32.59, 32.59, 32.59, 32.59, 32.66, 32.66, 32.66, 32.66, 32.66, 32.76, 32.76, 32.76, 32.76, 32.76, 32.39, 32.39, 32.39, 32.39, 32.39, 32.12, 32.12, 32.12, 32.12, 32.12, 32.04, 32.04, 32.04, 32.04, 32.04, 32.15, 32.15, 32.15, 32.15, 32.15, 32.3, 32.3, 32.3, 32.3, 32.3, 32.45, 32.45, 32.45, 32.45, 32.45, 32.46, 32.46, 32.46, 32.46, 32.46, 32.31, 32.31, 32.31, 32.31, 32.31, 32.3, 32.3, 32.3, 32.3, 32.3, 31.94, 31.94, 31.94, 31.94, 31.94, 31.85, 31.85, 31.85, 31.85, 31.85, 31.88, 31.88, 31.88, 31.88, 31.88, 31.97, 31.97, 31.97, 31.97, 31.97, 32.08, 32.08, 32.08, 32.08, 32.08, 32.17, 32.17, 32.17, 32.17, 32.17, 32.06, 32.06, 32.06, 32.06, 32.06, 31.63, 31.63, 31.63, 31.63, 31.63, 31.54, 31.54, 31.54, 31.54, 31.54, 30.89, 30.89, 30.89, 30.89, 30.89, 30.22, 30.22, 30.22, 30.22, 30.22, 30.12, 30.12, 30.12, 30.12, 30.12, 30.16, 30.16, 30.16, 30.16, 30.16, 30.24, 30.24, 30.24, 30.24, 30.24, 30.34, 30.34, 30.34, 30.34, 30.34, 30.41, 30.41, 30.41, 30.41, 30.41, 30.4, 30.4, 30.4, 30.4, 30.4, 30.18, 30.18, 30.18, 30.18, 30.18, 30.19, 30.19, 30.19, 30.19, 30.19, 30.08, 30.08, 30.08, 30.08, 30.08, 30.19, 30.19, 30.19, 30.19, 30.19, 30.31, 30.31, 30.31, 30.31, 30.31, 30.43, 30.43, 30.43, 30.43, 30.43, 30.5, 30.5, 30.5, 30.5, 30.5, 30.54, 30.54, 30.54, 30.54, 30.54, 30.63]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715205540 --> 1715206170
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.44, 0.44, 0.44, 0.44, 0.44, 0.48, 0.48, 0.48, 0.48, 0.48, 0.53, 0.53, 0.53, 0.53, 0.53, 0.42, 0.42, 0.42, 0.42, 0.42, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715205540 --> 1715206170
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0]
                    

github-actions[bot] avatar May 08 '24 22:05 github-actions[bot]