Fix self extend on the server.
The self extend is broken on the server according to this. https://github.com/ggerganov/llama.cpp/issues/7005 This PR tries to fix the self extend mechanism in the server. I tested it with passkey test and it could predict the passkey correctly. I have replicated the passkey test of llama.cpp, because I wasn't sure about how to interpret the results of the behave run. I basically copied the showed prompt from the behave passkey test and added token "[INST]" at the beginning and "[/INST]" at the end. Then I runned it on the completion endpoint.
Would be happy if someone could test it and give it a try
Edit: Did another test with mistral instruct v0.2 with 50.000 context text and the passkey once in the middle. It worked really well. Did another test without self extend enabled and it sayed that the passkey isn't in the text.
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=8491.26ms p(95)=20477.4ms fails=, finish reason: stop=483 truncated=68
- Prompt processing (pp): avg=101.21tk/s p(95)=430.92tk/s
- Token generation (tg): avg=34.55tk/s p(95)=48.95tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fixed_self_extension commit=f4f5b7ac560de66be4e875210f8c3679ef4b3dac
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 551 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1715525184 --> 1715525812
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 310.7, 310.7, 310.7, 310.7, 310.7, 680.35, 680.35, 680.35, 680.35, 680.35, 647.95, 647.95, 647.95, 647.95, 647.95, 711.44, 711.44, 711.44, 711.44, 711.44, 776.35, 776.35, 776.35, 776.35, 776.35, 773.83, 773.83, 773.83, 773.83, 773.83, 782.04, 782.04, 782.04, 782.04, 782.04, 805.93, 805.93, 805.93, 805.93, 805.93, 805.69, 805.69, 805.69, 805.69, 805.69, 818.85, 818.85, 818.85, 818.85, 818.85, 840.5, 840.5, 840.5, 840.5, 840.5, 868.04, 868.04, 868.04, 868.04, 868.04, 877.19, 877.19, 877.19, 877.19, 877.19, 853.75, 853.75, 853.75, 853.75, 853.75, 846.87, 846.87, 846.87, 846.87, 846.87, 856.02, 856.02, 856.02, 856.02, 856.02, 852.4, 852.4, 852.4, 852.4, 852.4, 864.1, 864.1, 864.1, 864.1, 864.1, 867.31, 867.31, 867.31, 867.31, 867.31, 872.78, 872.78, 872.78, 872.78, 872.78, 872.53, 872.53, 872.53, 872.53, 872.53, 874.02, 874.02, 874.02, 874.02, 874.02, 868.9, 868.9, 868.9, 868.9, 868.9, 862.88, 862.88, 862.88, 862.88, 862.88, 862.8, 862.8, 862.8, 862.8, 862.8, 864.61, 864.61, 864.61, 864.61, 864.61, 866.36, 866.36, 866.36, 866.36, 866.36, 864.08, 864.08, 864.08, 864.08, 864.08, 861.68, 861.68, 861.68, 861.68, 861.68, 864.2, 864.2, 864.2, 864.2, 864.2, 867.46, 867.46, 867.46, 867.46, 867.46, 865.24, 865.24, 865.24, 865.24, 865.24, 866.24, 866.24, 866.24, 866.24, 866.24, 876.57, 876.57, 876.57, 876.57, 876.57, 886.03, 886.03, 886.03, 886.03, 886.03, 892.31, 892.31, 892.31, 892.31, 892.31, 892.69, 892.69, 892.69, 892.69, 892.69, 890.24, 890.24, 890.24, 890.24, 890.24, 888.62, 888.62, 888.62, 888.62, 888.62, 889.97, 889.97, 889.97, 889.97, 889.97, 888.06, 888.06, 888.06, 888.06, 888.06, 897.0, 897.0, 897.0, 897.0, 897.0, 889.51, 889.51, 889.51, 889.51, 889.51, 869.78, 869.78, 869.78, 869.78, 869.78, 867.41, 867.41, 867.41, 867.41, 867.41, 864.59, 864.59, 864.59, 864.59, 864.59, 862.11, 862.11, 862.11, 862.11, 862.11, 862.16, 862.16, 862.16, 862.16, 862.16, 864.17, 864.17, 864.17, 864.17, 864.17, 865.65, 865.65, 865.65, 865.65, 865.65, 867.39, 867.39, 867.39, 867.39, 867.39, 871.63, 871.63, 871.63, 871.63, 871.63, 870.29, 870.29, 870.29, 870.29, 870.29, 870.03, 870.03, 870.03, 870.03, 870.03, 867.18, 867.18, 867.18, 867.18, 867.18, 868.36, 868.36, 868.36, 868.36, 868.36, 867.67, 867.67, 867.67, 867.67, 867.67, 868.58, 868.58, 868.58, 868.58, 868.58, 869.74, 869.74, 869.74, 869.74, 869.74, 872.73, 872.73, 872.73, 872.73, 872.73, 873.02, 873.02, 873.02, 873.02, 873.02]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 551 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1715525184 --> 1715525812
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.46, 44.46, 44.46, 44.46, 44.46, 40.98, 40.98, 40.98, 40.98, 40.98, 33.18, 33.18, 33.18, 33.18, 33.18, 33.59, 33.59, 33.59, 33.59, 33.59, 33.67, 33.67, 33.67, 33.67, 33.67, 34.11, 34.11, 34.11, 34.11, 34.11, 35.46, 35.46, 35.46, 35.46, 35.46, 35.68, 35.68, 35.68, 35.68, 35.68, 35.88, 35.88, 35.88, 35.88, 35.88, 35.15, 35.15, 35.15, 35.15, 35.15, 35.4, 35.4, 35.4, 35.4, 35.4, 35.27, 35.27, 35.27, 35.27, 35.27, 34.39, 34.39, 34.39, 34.39, 34.39, 33.54, 33.54, 33.54, 33.54, 33.54, 33.15, 33.15, 33.15, 33.15, 33.15, 33.1, 33.1, 33.1, 33.1, 33.1, 33.27, 33.27, 33.27, 33.27, 33.27, 32.94, 32.94, 32.94, 32.94, 32.94, 32.89, 32.89, 32.89, 32.89, 32.89, 32.78, 32.78, 32.78, 32.78, 32.78, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.38, 32.38, 32.38, 32.38, 32.38, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.21, 32.21, 32.21, 32.21, 32.21, 31.89, 31.89, 31.89, 31.89, 31.89, 31.62, 31.62, 31.62, 31.62, 31.62, 31.41, 31.41, 31.41, 31.41, 31.41, 31.54, 31.54, 31.54, 31.54, 31.54, 31.62, 31.62, 31.62, 31.62, 31.62, 31.78, 31.78, 31.78, 31.78, 31.78, 31.89, 31.89, 31.89, 31.89, 31.89, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 31.26, 31.26, 31.26, 31.26, 31.26, 31.25, 31.25, 31.25, 31.25, 31.25, 31.27, 31.27, 31.27, 31.27, 31.27, 31.49, 31.49, 31.49, 31.49, 31.49, 31.53, 31.53, 31.53, 31.53, 31.53, 31.61, 31.61, 31.61, 31.61, 31.61, 31.44, 31.44, 31.44, 31.44, 31.44, 31.13, 31.13, 31.13, 31.13, 31.13, 30.67, 30.67, 30.67, 30.67, 30.67, 29.83, 29.83, 29.83, 29.83, 29.83, 29.38, 29.38, 29.38, 29.38, 29.38, 29.42, 29.42, 29.42, 29.42, 29.42, 29.58, 29.58, 29.58, 29.58, 29.58, 29.6, 29.6, 29.6, 29.6, 29.6, 29.79, 29.79, 29.79, 29.79, 29.79, 29.81, 29.81, 29.81, 29.81, 29.81, 29.74, 29.74, 29.74, 29.74, 29.74, 29.64, 29.64, 29.64, 29.64, 29.64, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.02, 30.02, 30.02, 30.02, 30.02, 30.08, 30.08, 30.08, 30.08, 30.08, 30.13, 30.13, 30.13, 30.13, 30.13, 30.15, 30.15, 30.15, 30.15, 30.15]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 551 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1715525184 --> 1715525812
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.39, 0.39, 0.39, 0.39, 0.39, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.39, 0.39, 0.39, 0.39, 0.39, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.49, 0.49, 0.49, 0.49, 0.49, 0.59, 0.59, 0.59, 0.59, 0.59, 0.58, 0.58, 0.58, 0.58, 0.58, 0.5, 0.5, 0.5, 0.5, 0.5, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.06, 0.06, 0.06, 0.06, 0.06, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 551 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1715525184 --> 1715525812
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0]
I did additional tests and realized that the code that was previously removed didn't was called anyway. But in my tests it works as it should and it will find the passkey. For example hermes pro llama 8b with 8k context can retrieve the pass key with self extend from 50k tokens text, but will produce garbage without it.
It's probably better to go back and see which change makes the server test fail