llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Server: Unix Socket Support

Open adrianliechti opened this issue 1 year ago • 6 comments

The idea of this pull request is to ease integration of llama.cpp server using unix sockets instead tcp. cpp-httplib has support for unix sockets built in: https://github.com/yhirose/cpp-httplib/pull/1346

my idea was to not add an additional parameter, but use a --host prefix: unix:// (similar to docker's client/server pattern).

a very first attempt is here, mainly to understand if this is something you could imagine in the code.

(the file should not exist before)
./server --host unix:///tmp/llama.sock --model ~/Projects/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

connect using socat

socat TCP-LISTEN:1234,fork UNIX-CONNECT:/tmp/llama.sock
curl http://localhost:1234/v1/model

connect using curl:

curl --unix-sock /tmp/llama.sock http://localhost/v1/models

open points:

  • make path absolute?
  • some error handling?

adrianliechti avatar Mar 31 '24 22:03 adrianliechti

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 534 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8746.54ms p(90)=25799.57ms fails=0, finish reason: stop=534 truncated=0
  • Prompt processing (pp): avg=235.76tk/s p(90)=696.9tk/s total=206.45tk/s
  • Token generation (tg): avg=100.17tk/s p(90)=269.19tk/s total=131.19tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=0b70ac0f6606fd1583afeed5a0bacec035d34444
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 583.52, 583.52, 583.52, 583.52, 583.52, 668.32, 668.32, 668.32, 668.32, 668.32, 678.72, 678.72, 678.72, 678.72, 678.72, 683.72, 683.72, 683.72, 683.72, 683.72, 716.03, 716.03, 716.03, 716.03, 716.03, 710.09, 710.09, 710.09, 710.09, 710.09, 699.07, 699.07, 699.07, 699.07, 699.07, 676.52, 676.52, 676.52, 676.52, 676.52, 688.86, 688.86, 688.86, 688.86, 688.86, 688.17, 688.17, 688.17, 688.17, 688.17, 704.66, 704.66, 704.66, 704.66, 704.66, 724.39, 724.39, 724.39, 724.39, 724.39, 713.05, 713.05, 713.05, 713.05, 713.05, 711.05, 711.05, 711.05, 711.05, 711.05, 704.3, 704.3, 704.3, 704.3, 704.3, 708.47, 708.47, 708.47, 708.47, 708.47, 707.52, 707.52, 707.52, 707.52, 707.52, 715.17, 715.17, 715.17, 715.17, 715.17, 713.96, 713.96, 713.96, 713.96, 713.96, 713.15, 713.15, 713.15, 713.15, 713.15, 711.88, 711.88, 711.88, 711.88, 711.88, 711.97, 711.97, 711.97, 711.97, 711.97, 714.76, 714.76, 714.76, 714.76, 714.76, 720.86, 720.86, 720.86, 720.86, 720.86, 727.2, 727.2, 727.2, 727.2, 727.2, 728.22, 728.22, 728.22, 728.22, 728.22, 728.36, 728.36, 728.36, 728.36, 728.36, 734.39, 734.39, 734.39, 734.39, 734.39, 730.88, 730.88, 730.88, 730.88, 730.88, 729.13, 729.13, 729.13, 729.13, 729.13, 730.0, 730.0, 730.0, 730.0, 730.0, 731.2, 731.2, 731.2, 731.2, 731.2, 730.72, 730.72, 730.72, 730.72, 730.72, 731.03, 731.03, 731.03, 731.03, 731.03, 731.58, 731.58, 731.58, 731.58, 731.58, 737.91, 737.91, 737.91, 737.91, 737.91, 740.62, 740.62, 740.62, 740.62, 740.62, 740.69, 740.69, 740.69, 740.69, 740.69, 738.85, 738.85, 738.85, 738.85, 738.85, 737.25, 737.25, 737.25, 737.25, 737.25, 739.99, 739.99, 739.99, 739.99, 739.99, 743.24, 743.24, 743.24, 743.24, 743.24, 744.39, 744.39, 744.39, 744.39, 744.39, 722.39, 722.39, 722.39, 722.39, 722.39, 720.43, 720.43, 720.43, 720.43, 720.43, 712.51, 712.51, 712.51, 712.51, 712.51, 711.54, 711.54, 711.54, 711.54, 711.54, 710.12, 710.12, 710.12, 710.12, 710.12, 709.72, 709.72, 709.72, 709.72, 709.72, 712.14, 712.14, 712.14, 712.14, 712.14, 712.04, 712.04, 712.04, 712.04, 712.04, 706.51, 706.51, 706.51, 706.51, 706.51, 704.92, 704.92, 704.92, 704.92, 704.92, 707.56, 707.56, 707.56, 707.56, 707.56, 707.65, 707.65, 707.65, 707.65, 707.65, 705.13, 705.13, 705.13, 705.13, 705.13, 706.09, 706.09, 706.09, 706.09, 706.09, 706.07, 706.07, 706.07, 706.07, 706.07, 705.99, 705.99, 705.99, 705.99, 705.99, 707.16, 707.16, 707.16, 707.16, 707.16]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.19, 29.19, 29.19, 29.19, 29.19, 16.76, 16.76, 16.76, 16.76, 16.76, 17.16, 17.16, 17.16, 17.16, 17.16, 17.39, 17.39, 17.39, 17.39, 17.39, 17.47, 17.47, 17.47, 17.47, 17.47, 17.85, 17.85, 17.85, 17.85, 17.85, 18.75, 18.75, 18.75, 18.75, 18.75, 19.26, 19.26, 19.26, 19.26, 19.26, 19.53, 19.53, 19.53, 19.53, 19.53, 19.63, 19.63, 19.63, 19.63, 19.63, 19.94, 19.94, 19.94, 19.94, 19.94, 19.82, 19.82, 19.82, 19.82, 19.82, 19.44, 19.44, 19.44, 19.44, 19.44, 19.13, 19.13, 19.13, 19.13, 19.13, 18.83, 18.83, 18.83, 18.83, 18.83, 18.46, 18.46, 18.46, 18.46, 18.46, 18.59, 18.59, 18.59, 18.59, 18.59, 18.74, 18.74, 18.74, 18.74, 18.74, 18.58, 18.58, 18.58, 18.58, 18.58, 18.49, 18.49, 18.49, 18.49, 18.49, 18.41, 18.41, 18.41, 18.41, 18.41, 18.2, 18.2, 18.2, 18.2, 18.2, 18.18, 18.18, 18.18, 18.18, 18.18, 18.28, 18.28, 18.28, 18.28, 18.28, 18.2, 18.2, 18.2, 18.2, 18.2, 18.27, 18.27, 18.27, 18.27, 18.27, 18.32, 18.32, 18.32, 18.32, 18.32, 18.38, 18.38, 18.38, 18.38, 18.38, 18.27, 18.27, 18.27, 18.27, 18.27, 18.19, 18.19, 18.19, 18.19, 18.19, 18.28, 18.28, 18.28, 18.28, 18.28, 18.32, 18.32, 18.32, 18.32, 18.32, 18.35, 18.35, 18.35, 18.35, 18.35, 18.49, 18.49, 18.49, 18.49, 18.49, 18.55, 18.55, 18.55, 18.55, 18.55, 18.51, 18.51, 18.51, 18.51, 18.51, 18.44, 18.44, 18.44, 18.44, 18.44, 18.36, 18.36, 18.36, 18.36, 18.36, 18.34, 18.34, 18.34, 18.34, 18.34, 18.4, 18.4, 18.4, 18.4, 18.4, 18.42, 18.42, 18.42, 18.42, 18.42, 18.48, 18.48, 18.48, 18.48, 18.48, 18.43, 18.43, 18.43, 18.43, 18.43, 18.25, 18.25, 18.25, 18.25, 18.25, 18.21, 18.21, 18.21, 18.21, 18.21, 17.95, 17.95, 17.95, 17.95, 17.95, 17.93, 17.93, 17.93, 17.93, 17.93, 17.74, 17.74, 17.74, 17.74, 17.74, 17.42, 17.42, 17.42, 17.42, 17.42, 17.4, 17.4, 17.4, 17.4, 17.4, 17.47, 17.47, 17.47, 17.47, 17.47, 17.48, 17.48, 17.48, 17.48, 17.48, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58, 17.58, 17.61, 17.61, 17.61, 17.61, 17.61, 17.62, 17.62, 17.62, 17.62, 17.62, 17.68, 17.68, 17.68, 17.68, 17.68, 17.75, 17.75, 17.75, 17.75, 17.75, 17.83, 17.83, 17.83, 17.83, 17.83]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.43, 0.43, 0.43, 0.43, 0.43, 0.48, 0.48, 0.48, 0.48, 0.48, 0.4, 0.4, 0.4, 0.4, 0.4, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    

github-actions[bot] avatar Mar 31 '24 22:03 github-actions[bot]

This should be tested in all platforms

FSSRepo avatar Apr 01 '24 00:04 FSSRepo

@FSSRepo you are right. i tested on linux and macos, and added a ifndef for windows - similar to httplib's implementation. do you have more platforms in mind?

@phymbert might you have some guidance here? shall i add a sample shell script or extend the python test suite? i mainly ask because i don't want to slow down every test cycle for such a niche feature...

adrianliechti avatar Apr 01 '24 10:04 adrianliechti

might you have some guidance here? shall i add a sample shell script or extend the python test suite?

I suggest adding a simple dedicated scenario in a new feature using unix://. I hope no additional changes are required since we already checked the sock family in the python glue. Regarding the overhead of the new scenario, we are using a very small model, so adding a new scenario matters in seconds. It's OK.

phymbert avatar Apr 01 '24 10:04 phymbert

Regarding server tests, @phymbert has provided quite good documentation over here: https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

ggerganov avatar Apr 01 '24 11:04 ggerganov

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

Yes, a good example is:

  • https://github.com/ggerganov/llama.cpp/pull/6341/files#diff-078c109cd25774fd54e338cc718c4ed8ffafa3e1af520e584dd2de0319ad7a66R1
  • in #6341

phymbert avatar Apr 01 '24 13:04 phymbert

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 529 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8867.72ms p(95)=21960.19ms fails=, finish reason: stop=471 truncated=58
  • Prompt processing (pp): avg=100.77tk/s p(95)=417.74tk/s
  • Token generation (tg): avg=45.8tk/s p(95)=47.8tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=d7a7a780c95de47c96dcc16585099412d89e24be

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 566.99, 566.99, 566.99, 566.99, 566.99, 653.51, 653.51, 653.51, 653.51, 653.51, 656.19, 656.19, 656.19, 656.19, 656.19, 689.72, 689.72, 689.72, 689.72, 689.72, 776.04, 776.04, 776.04, 776.04, 776.04, 780.67, 780.67, 780.67, 780.67, 780.67, 808.43, 808.43, 808.43, 808.43, 808.43, 813.88, 813.88, 813.88, 813.88, 813.88, 834.12, 834.12, 834.12, 834.12, 834.12, 853.73, 853.73, 853.73, 853.73, 853.73, 850.75, 850.75, 850.75, 850.75, 850.75, 864.6, 864.6, 864.6, 864.6, 864.6, 876.12, 876.12, 876.12, 876.12, 876.12, 895.73, 895.73, 895.73, 895.73, 895.73, 897.79, 897.79, 897.79, 897.79, 897.79, 895.87, 895.87, 895.87, 895.87, 895.87, 882.94, 882.94, 882.94, 882.94, 882.94, 885.57, 885.57, 885.57, 885.57, 885.57, 891.63, 891.63, 891.63, 891.63, 891.63, 903.93, 903.93, 903.93, 903.93, 903.93, 905.31, 905.31, 905.31, 905.31, 905.31, 909.85, 909.85, 909.85, 909.85, 909.85, 908.08, 908.08, 908.08, 908.08, 908.08, 909.09, 909.09, 909.09, 909.09, 909.09, 921.8, 921.8, 921.8, 921.8, 921.8, 919.74, 919.74, 919.74, 919.74, 919.74, 921.02, 921.02, 921.02, 921.02, 921.02, 922.19, 922.19, 922.19, 922.19, 922.19, 917.22, 917.22, 917.22, 917.22, 917.22, 915.05, 915.05, 915.05, 915.05, 915.05, 915.96, 915.96, 915.96, 915.96, 915.96, 912.15, 912.15, 912.15, 912.15, 912.15, 909.54, 909.54, 909.54, 909.54, 909.54, 908.38, 908.38, 908.38, 908.38, 908.38, 909.84, 909.84, 909.84, 909.84, 909.84, 918.0, 918.0, 918.0, 918.0, 918.0, 921.16, 921.16, 921.16, 921.16, 921.16, 923.75, 923.75, 923.75, 923.75, 923.75, 879.25, 879.25, 879.25, 879.25, 879.25, 876.11, 876.11, 876.11, 876.11, 876.11, 876.72, 876.72, 876.72, 876.72, 876.72, 880.29, 880.29, 880.29, 880.29, 880.29, 881.12, 881.12, 881.12, 881.12, 881.12, 884.91, 884.91, 884.91, 884.91, 884.91, 863.79, 863.79, 863.79, 863.79, 863.79, 864.47, 864.47, 864.47, 864.47, 864.47, 864.8, 864.8, 864.8, 864.8, 864.8, 862.39, 862.39, 862.39, 862.39, 862.39, 864.09, 864.09, 864.09, 864.09, 864.09, 853.67, 853.67, 853.67, 853.67, 853.67, 852.67, 852.67, 852.67, 852.67, 852.67, 853.51, 853.51, 853.51, 853.51, 853.51, 853.54, 853.54, 853.54, 853.54, 853.54, 855.43, 855.43, 855.43, 855.43, 855.43, 858.51, 858.51, 858.51, 858.51, 858.51, 859.65, 859.65, 859.65, 859.65, 859.65, 864.13, 864.13, 864.13, 864.13, 864.13, 864.37, 864.37, 864.37, 864.37, 864.37, 864.19, 864.19, 864.19, 864.19, 864.19, 860.84, 860.84, 860.84, 860.84, 860.84, 860.95, 860.95, 860.95, 860.95]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.39, 44.39, 44.39, 44.39, 44.39, 38.16, 38.16, 38.16, 38.16, 38.16, 28.29, 28.29, 28.29, 28.29, 28.29, 31.48, 31.48, 31.48, 31.48, 31.48, 31.46, 31.46, 31.46, 31.46, 31.46, 33.49, 33.49, 33.49, 33.49, 33.49, 34.81, 34.81, 34.81, 34.81, 34.81, 34.94, 34.94, 34.94, 34.94, 34.94, 35.18, 35.18, 35.18, 35.18, 35.18, 34.89, 34.89, 34.89, 34.89, 34.89, 34.87, 34.87, 34.87, 34.87, 34.87, 33.86, 33.86, 33.86, 33.86, 33.86, 32.8, 32.8, 32.8, 32.8, 32.8, 32.44, 32.44, 32.44, 32.44, 32.44, 31.73, 31.73, 31.73, 31.73, 31.73, 31.07, 31.07, 31.07, 31.07, 31.07, 28.64, 28.64, 28.64, 28.64, 28.64, 28.59, 28.59, 28.59, 28.59, 28.59, 29.05, 29.05, 29.05, 29.05, 29.05, 29.0, 29.0, 29.0, 29.0, 29.0, 28.9, 28.9, 28.9, 28.9, 28.9, 28.88, 28.88, 28.88, 28.88, 28.88, 28.98, 28.98, 28.98, 28.98, 28.98, 29.13, 29.13, 29.13, 29.13, 29.13, 29.31, 29.31, 29.31, 29.31, 29.31, 29.3, 29.3, 29.3, 29.3, 29.3, 29.41, 29.41, 29.41, 29.41, 29.41, 29.64, 29.64, 29.64, 29.64, 29.64, 29.71, 29.71, 29.71, 29.71, 29.71, 29.79, 29.79, 29.79, 29.79, 29.79, 30.14, 30.14, 30.14, 30.14, 30.14, 30.19, 30.19, 30.19, 30.19, 30.19, 30.33, 30.33, 30.33, 30.33, 30.33, 30.38, 30.38, 30.38, 30.38, 30.38, 30.54, 30.54, 30.54, 30.54, 30.54, 30.47, 30.47, 30.47, 30.47, 30.47, 30.46, 30.46, 30.46, 30.46, 30.46, 29.93, 29.93, 29.93, 29.93, 29.93, 29.81, 29.81, 29.81, 29.81, 29.81, 29.75, 29.75, 29.75, 29.75, 29.75, 29.79, 29.79, 29.79, 29.79, 29.79, 29.99, 29.99, 29.99, 29.99, 29.99, 30.08, 30.08, 30.08, 30.08, 30.08, 30.22, 30.22, 30.22, 30.22, 30.22, 30.18, 30.18, 30.18, 30.18, 30.18, 29.93, 29.93, 29.93, 29.93, 29.93, 29.3, 29.3, 29.3, 29.3, 29.3, 28.83, 28.83, 28.83, 28.83, 28.83, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.82, 28.82, 28.82, 28.82, 28.82, 28.91, 28.91, 28.91, 28.91, 28.91, 28.95, 28.95, 28.95, 28.95, 28.95, 29.05, 29.05, 29.05, 29.05, 29.05, 29.1, 29.1, 29.1, 29.1, 29.1, 29.01, 29.01, 29.01, 29.01, 29.01, 28.95, 28.95, 28.95, 28.95, 28.95, 28.98, 28.98, 28.98, 28.98, 28.98, 29.08, 29.08, 29.08, 29.08, 29.08, 29.2, 29.2, 29.2, 29.2, 29.2, 29.24, 29.24, 29.24, 29.24]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.44, 0.44, 0.44, 0.44, 0.44, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.45, 0.45, 0.45, 0.45, 0.45, 0.49, 0.49, 0.49, 0.49, 0.49, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.29, 0.29, 0.29, 0.29, 0.29, 0.3, 0.3, 0.3, 0.3, 0.3, 0.36, 0.36, 0.36, 0.36, 0.36, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.45, 0.45, 0.45, 0.45, 0.45, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.35, 0.35, 0.35, 0.35, 0.35, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    

github-actions[bot] avatar May 08 '24 22:05 github-actions[bot]