llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Implement '--keep-split' to quantize model into several shards

Open zj040045 opened this issue 1 year ago • 3 comments

Fix https://github.com/ggerganov/llama.cpp/issues/6548 --keep-split allows quantize to output shards instead of a full model. The number of shards depends on the input model files

zj040045 avatar Apr 15 '24 14:04 zj040045

Thanks. Do you mind to add a tests.sh as we did in #6655

phymbert avatar Apr 17 '24 17:04 phymbert

@phymbert Done

zj040045 avatar Apr 18 '24 14:04 zj040045

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 440 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=10696.13ms p(95)=27432.23ms fails=, finish reason: stop=391 truncated=49
  • Prompt processing (pp): avg=123.53tk/s p(95)=583.07tk/s
  • Token generation (tg): avg=26.16tk/s p(95)=35.91tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=jiez/quantize-keep-split commit=79bbf42495b9d01ec93ae87d8ef7e24d8f721e3f

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 440 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713451265 --> 1713451893
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 248.75, 248.75, 248.75, 248.75, 248.75, 664.64, 664.64, 664.64, 664.64, 664.64, 480.28, 480.28, 480.28, 480.28, 480.28, 514.97, 514.97, 514.97, 514.97, 514.97, 568.19, 568.19, 568.19, 568.19, 568.19, 579.39, 579.39, 579.39, 579.39, 579.39, 579.85, 579.85, 579.85, 579.85, 579.85, 586.58, 586.58, 586.58, 586.58, 586.58, 613.61, 613.61, 613.61, 613.61, 613.61, 615.87, 615.87, 615.87, 615.87, 615.87, 625.16, 625.16, 625.16, 625.16, 625.16, 625.81, 625.81, 625.81, 625.81, 625.81, 641.75, 641.75, 641.75, 641.75, 641.75, 640.11, 640.11, 640.11, 640.11, 640.11, 623.96, 623.96, 623.96, 623.96, 623.96, 630.89, 630.89, 630.89, 630.89, 630.89, 613.12, 613.12, 613.12, 613.12, 613.12, 610.76, 610.76, 610.76, 610.76, 610.76, 614.99, 614.99, 614.99, 614.99, 614.99, 615.52, 615.52, 615.52, 615.52, 615.52, 627.58, 627.58, 627.58, 627.58, 627.58, 632.74, 632.74, 632.74, 632.74, 632.74, 632.81, 632.81, 632.81, 632.81, 632.81, 632.08, 632.08, 632.08, 632.08, 632.08, 638.12, 638.12, 638.12, 638.12, 638.12, 638.63, 638.63, 638.63, 638.63, 638.63, 641.58, 641.58, 641.58, 641.58, 641.58, 636.87, 636.87, 636.87, 636.87, 636.87, 611.0, 611.0, 611.0, 611.0, 611.0, 614.4, 614.4, 614.4, 614.4, 614.4, 622.69, 622.69, 622.69, 622.69, 622.69, 626.24, 626.24, 626.24, 626.24, 626.24, 624.92, 624.92, 624.92, 624.92, 624.92, 625.06, 625.06, 625.06, 625.06, 625.06, 626.39, 626.39, 626.39, 626.39, 626.39, 628.81, 628.81, 628.81, 628.81, 628.81, 632.3, 632.3, 632.3, 632.3, 632.3, 632.88, 632.88, 632.88, 632.88, 632.88, 632.19, 632.19, 632.19, 632.19, 632.19, 634.13, 634.13, 634.13, 634.13, 634.13, 632.25, 632.25, 632.25, 632.25, 632.25, 637.49, 637.49, 637.49, 637.49, 637.49, 643.45, 643.45, 643.45, 643.45, 643.45, 648.5, 648.5, 648.5, 648.5, 648.5, 651.0, 651.0, 651.0, 651.0, 651.0, 650.58, 650.58, 650.58, 650.58, 650.58, 650.75, 650.75, 650.75, 650.75, 650.75, 654.34, 654.34, 654.34, 654.34, 654.34, 656.73, 656.73, 656.73, 656.73, 656.73, 663.31, 663.31, 663.31, 663.31, 663.31, 657.07, 657.07, 657.07, 657.07, 657.07, 652.23, 652.23, 652.23, 652.23, 652.23, 642.72, 642.72, 642.72, 642.72, 642.72, 641.69, 641.69, 641.69, 641.69, 641.69, 640.83, 640.83, 640.83, 640.83, 640.83, 639.0, 639.0, 639.0, 639.0, 639.0, 636.97, 636.97, 636.97, 636.97, 636.97, 641.5, 641.5, 641.5, 641.5, 641.5, 644.62, 644.62, 644.62, 644.62, 644.62, 644.87, 644.87, 644.87, 644.87, 644.87, 647.47, 647.47, 647.47, 647.47, 647.47]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 440 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713451265 --> 1713451893
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.12, 36.12, 36.12, 36.12, 36.12, 26.16, 26.16, 26.16, 26.16, 26.16, 24.73, 24.73, 24.73, 24.73, 24.73, 26.48, 26.48, 26.48, 26.48, 26.48, 27.23, 27.23, 27.23, 27.23, 27.23, 27.68, 27.68, 27.68, 27.68, 27.68, 26.85, 26.85, 26.85, 26.85, 26.85, 26.96, 26.96, 26.96, 26.96, 26.96, 27.4, 27.4, 27.4, 27.4, 27.4, 27.58, 27.58, 27.58, 27.58, 27.58, 27.07, 27.07, 27.07, 27.07, 27.07, 26.5, 26.5, 26.5, 26.5, 26.5, 26.25, 26.25, 26.25, 26.25, 26.25, 25.49, 25.49, 25.49, 25.49, 25.49, 25.27, 25.27, 25.27, 25.27, 25.27, 24.75, 24.75, 24.75, 24.75, 24.75, 24.1, 24.1, 24.1, 24.1, 24.1, 23.63, 23.63, 23.63, 23.63, 23.63, 23.65, 23.65, 23.65, 23.65, 23.65, 23.73, 23.73, 23.73, 23.73, 23.73, 23.82, 23.82, 23.82, 23.82, 23.82, 23.57, 23.57, 23.57, 23.57, 23.57, 22.98, 22.98, 22.98, 22.98, 22.98, 22.63, 22.63, 22.63, 22.63, 22.63, 22.62, 22.62, 22.62, 22.62, 22.62, 22.39, 22.39, 22.39, 22.39, 22.39, 22.46, 22.46, 22.46, 22.46, 22.46, 22.65, 22.65, 22.65, 22.65, 22.65, 22.65, 22.65, 22.65, 22.65, 22.65, 22.88, 22.88, 22.88, 22.88, 22.88, 23.04, 23.04, 23.04, 23.04, 23.04, 22.97, 22.97, 22.97, 22.97, 22.97, 22.72, 22.72, 22.72, 22.72, 22.72, 22.47, 22.47, 22.47, 22.47, 22.47, 22.51, 22.51, 22.51, 22.51, 22.51, 22.7, 22.7, 22.7, 22.7, 22.7, 22.78, 22.78, 22.78, 22.78, 22.78, 22.78, 22.78, 22.78, 22.78, 22.78, 22.87, 22.87, 22.87, 22.87, 22.87, 22.92, 22.92, 22.92, 22.92, 22.92, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.94, 22.79, 22.79, 22.79, 22.79, 22.79, 22.49, 22.49, 22.49, 22.49, 22.49, 22.5, 22.5, 22.5, 22.5, 22.5, 22.63, 22.63, 22.63, 22.63, 22.63, 22.67, 22.67, 22.67, 22.67, 22.67, 22.82, 22.82, 22.82, 22.82, 22.82, 22.95, 22.95, 22.95, 22.95, 22.95, 22.95, 22.95, 22.95, 22.95, 22.95, 22.8, 22.8, 22.8, 22.8, 22.8, 22.66, 22.66, 22.66, 22.66, 22.66, 22.62, 22.62, 22.62, 22.62, 22.62, 22.35, 22.35, 22.35, 22.35, 22.35, 22.29, 22.29, 22.29, 22.29, 22.29, 20.88, 20.88, 20.88, 20.88, 20.88, 20.9, 20.9, 20.9, 20.9, 20.9, 20.93, 20.93, 20.93, 20.93, 20.93, 20.97, 20.97, 20.97, 20.97, 20.97, 21.08, 21.08, 21.08, 21.08, 21.08]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 440 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713451265 --> 1713451893
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.12, 0.12, 0.12, 0.12, 0.12, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.3, 0.3, 0.3, 0.3, 0.3, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.41, 0.41, 0.41, 0.41, 0.41, 0.45, 0.45, 0.45, 0.45, 0.45, 0.51, 0.51, 0.51, 0.51, 0.51, 0.59, 0.59, 0.59, 0.59, 0.59, 0.57, 0.57, 0.57, 0.57, 0.57, 0.61, 0.61, 0.61, 0.61, 0.61, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 440 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713451265 --> 1713451893
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0]
                    

github-actions[bot] avatar Apr 18 '24 14:04 github-actions[bot]