llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Fix self extend on the server.

Open Maximilian-Winter opened this issue 1 year ago • 3 comments

The self extend is broken on the server according to this. https://github.com/ggerganov/llama.cpp/issues/7005 This PR tries to fix the self extend mechanism in the server. I tested it with passkey test and it could predict the passkey correctly. I have replicated the passkey test of llama.cpp, because I wasn't sure about how to interpret the results of the behave run. I basically copied the showed prompt from the behave passkey test and added token "[INST]" at the beginning and "[/INST]" at the end. Then I runned it on the completion endpoint.

Would be happy if someone could test it and give it a try

Edit: Did another test with mistral instruct v0.2 with 50.000 context text and the passkey once in the middle. It worked really well. Did another test without self extend enabled and it sayed that the passkey isn't in the text.

Maximilian-Winter avatar May 12 '24 10:05 Maximilian-Winter

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8491.26ms p(95)=20477.4ms fails=, finish reason: stop=483 truncated=68
  • Prompt processing (pp): avg=101.21tk/s p(95)=430.92tk/s
  • Token generation (tg): avg=34.55tk/s p(95)=48.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fixed_self_extension commit=f4f5b7ac560de66be4e875210f8c3679ef4b3dac

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 310.7, 310.7, 310.7, 310.7, 310.7, 680.35, 680.35, 680.35, 680.35, 680.35, 647.95, 647.95, 647.95, 647.95, 647.95, 711.44, 711.44, 711.44, 711.44, 711.44, 776.35, 776.35, 776.35, 776.35, 776.35, 773.83, 773.83, 773.83, 773.83, 773.83, 782.04, 782.04, 782.04, 782.04, 782.04, 805.93, 805.93, 805.93, 805.93, 805.93, 805.69, 805.69, 805.69, 805.69, 805.69, 818.85, 818.85, 818.85, 818.85, 818.85, 840.5, 840.5, 840.5, 840.5, 840.5, 868.04, 868.04, 868.04, 868.04, 868.04, 877.19, 877.19, 877.19, 877.19, 877.19, 853.75, 853.75, 853.75, 853.75, 853.75, 846.87, 846.87, 846.87, 846.87, 846.87, 856.02, 856.02, 856.02, 856.02, 856.02, 852.4, 852.4, 852.4, 852.4, 852.4, 864.1, 864.1, 864.1, 864.1, 864.1, 867.31, 867.31, 867.31, 867.31, 867.31, 872.78, 872.78, 872.78, 872.78, 872.78, 872.53, 872.53, 872.53, 872.53, 872.53, 874.02, 874.02, 874.02, 874.02, 874.02, 868.9, 868.9, 868.9, 868.9, 868.9, 862.88, 862.88, 862.88, 862.88, 862.88, 862.8, 862.8, 862.8, 862.8, 862.8, 864.61, 864.61, 864.61, 864.61, 864.61, 866.36, 866.36, 866.36, 866.36, 866.36, 864.08, 864.08, 864.08, 864.08, 864.08, 861.68, 861.68, 861.68, 861.68, 861.68, 864.2, 864.2, 864.2, 864.2, 864.2, 867.46, 867.46, 867.46, 867.46, 867.46, 865.24, 865.24, 865.24, 865.24, 865.24, 866.24, 866.24, 866.24, 866.24, 866.24, 876.57, 876.57, 876.57, 876.57, 876.57, 886.03, 886.03, 886.03, 886.03, 886.03, 892.31, 892.31, 892.31, 892.31, 892.31, 892.69, 892.69, 892.69, 892.69, 892.69, 890.24, 890.24, 890.24, 890.24, 890.24, 888.62, 888.62, 888.62, 888.62, 888.62, 889.97, 889.97, 889.97, 889.97, 889.97, 888.06, 888.06, 888.06, 888.06, 888.06, 897.0, 897.0, 897.0, 897.0, 897.0, 889.51, 889.51, 889.51, 889.51, 889.51, 869.78, 869.78, 869.78, 869.78, 869.78, 867.41, 867.41, 867.41, 867.41, 867.41, 864.59, 864.59, 864.59, 864.59, 864.59, 862.11, 862.11, 862.11, 862.11, 862.11, 862.16, 862.16, 862.16, 862.16, 862.16, 864.17, 864.17, 864.17, 864.17, 864.17, 865.65, 865.65, 865.65, 865.65, 865.65, 867.39, 867.39, 867.39, 867.39, 867.39, 871.63, 871.63, 871.63, 871.63, 871.63, 870.29, 870.29, 870.29, 870.29, 870.29, 870.03, 870.03, 870.03, 870.03, 870.03, 867.18, 867.18, 867.18, 867.18, 867.18, 868.36, 868.36, 868.36, 868.36, 868.36, 867.67, 867.67, 867.67, 867.67, 867.67, 868.58, 868.58, 868.58, 868.58, 868.58, 869.74, 869.74, 869.74, 869.74, 869.74, 872.73, 872.73, 872.73, 872.73, 872.73, 873.02, 873.02, 873.02, 873.02, 873.02]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.46, 44.46, 44.46, 44.46, 44.46, 40.98, 40.98, 40.98, 40.98, 40.98, 33.18, 33.18, 33.18, 33.18, 33.18, 33.59, 33.59, 33.59, 33.59, 33.59, 33.67, 33.67, 33.67, 33.67, 33.67, 34.11, 34.11, 34.11, 34.11, 34.11, 35.46, 35.46, 35.46, 35.46, 35.46, 35.68, 35.68, 35.68, 35.68, 35.68, 35.88, 35.88, 35.88, 35.88, 35.88, 35.15, 35.15, 35.15, 35.15, 35.15, 35.4, 35.4, 35.4, 35.4, 35.4, 35.27, 35.27, 35.27, 35.27, 35.27, 34.39, 34.39, 34.39, 34.39, 34.39, 33.54, 33.54, 33.54, 33.54, 33.54, 33.15, 33.15, 33.15, 33.15, 33.15, 33.1, 33.1, 33.1, 33.1, 33.1, 33.27, 33.27, 33.27, 33.27, 33.27, 32.94, 32.94, 32.94, 32.94, 32.94, 32.89, 32.89, 32.89, 32.89, 32.89, 32.78, 32.78, 32.78, 32.78, 32.78, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.38, 32.38, 32.38, 32.38, 32.38, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.21, 32.21, 32.21, 32.21, 32.21, 31.89, 31.89, 31.89, 31.89, 31.89, 31.62, 31.62, 31.62, 31.62, 31.62, 31.41, 31.41, 31.41, 31.41, 31.41, 31.54, 31.54, 31.54, 31.54, 31.54, 31.62, 31.62, 31.62, 31.62, 31.62, 31.78, 31.78, 31.78, 31.78, 31.78, 31.89, 31.89, 31.89, 31.89, 31.89, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 31.26, 31.26, 31.26, 31.26, 31.26, 31.25, 31.25, 31.25, 31.25, 31.25, 31.27, 31.27, 31.27, 31.27, 31.27, 31.49, 31.49, 31.49, 31.49, 31.49, 31.53, 31.53, 31.53, 31.53, 31.53, 31.61, 31.61, 31.61, 31.61, 31.61, 31.44, 31.44, 31.44, 31.44, 31.44, 31.13, 31.13, 31.13, 31.13, 31.13, 30.67, 30.67, 30.67, 30.67, 30.67, 29.83, 29.83, 29.83, 29.83, 29.83, 29.38, 29.38, 29.38, 29.38, 29.38, 29.42, 29.42, 29.42, 29.42, 29.42, 29.58, 29.58, 29.58, 29.58, 29.58, 29.6, 29.6, 29.6, 29.6, 29.6, 29.79, 29.79, 29.79, 29.79, 29.79, 29.81, 29.81, 29.81, 29.81, 29.81, 29.74, 29.74, 29.74, 29.74, 29.74, 29.64, 29.64, 29.64, 29.64, 29.64, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.02, 30.02, 30.02, 30.02, 30.02, 30.08, 30.08, 30.08, 30.08, 30.08, 30.13, 30.13, 30.13, 30.13, 30.13, 30.15, 30.15, 30.15, 30.15, 30.15]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.39, 0.39, 0.39, 0.39, 0.39, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.39, 0.39, 0.39, 0.39, 0.39, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.49, 0.49, 0.49, 0.49, 0.49, 0.59, 0.59, 0.59, 0.59, 0.59, 0.58, 0.58, 0.58, 0.58, 0.58, 0.5, 0.5, 0.5, 0.5, 0.5, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.06, 0.06, 0.06, 0.06, 0.06, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    

github-actions[bot] avatar May 12 '24 10:05 github-actions[bot]

I did additional tests and realized that the code that was previously removed didn't was called anyway. But in my tests it works as it should and it will find the passkey. For example hermes pro llama 8b with 8k context can retrieve the pass key with self extend from 50k tokens text, but will produce garbage without it.

Maximilian-Winter avatar May 12 '24 14:05 Maximilian-Winter

It's probably better to go back and see which change makes the server test fail

ggerganov avatar May 12 '24 15:05 ggerganov