llama.cpp llama : add option to render special/control tokens

fix #6770

Setting special == true in llama_token_to_piece() will cause special/control tokens' text to be rendered in the output:

https://github.com/ggerganov/llama.cpp/blob/1f45c2adc7b10637c2035e622573f1851e403979/llama.h#L827-L837

Apr 21 '24 12:04 ggerganov

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 215 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=22612.46ms p(95)=38873.62ms fails=, finish reason: stop=101 truncated=114
Prompt processing (pp): avg=269.66tk/s p(95)=800.94tk/s
Token generation (tg): avg=23.51tk/s p(95)=26.01tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/render-control-tokens commit=ed5d273c4dcc075a86b94a831bb825fb98519ce0

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 307.32, 307.32, 307.32, 307.32, 307.32, 341.48, 341.48, 341.48, 341.48, 341.48, 648.34, 648.34, 648.34, 648.34, 648.34, 655.03, 655.03, 655.03, 655.03, 655.03, 651.63, 651.63, 651.63, 651.63, 651.63, 637.9, 637.9, 637.9, 637.9, 637.9, 626.6, 626.6, 626.6, 626.6, 626.6, 646.59, 646.59, 646.59, 646.59, 646.59, 643.5, 643.5, 643.5, 643.5, 643.5, 657.53, 657.53, 657.53, 657.53, 657.53, 657.86, 657.86, 657.86, 657.86, 657.86, 677.98, 677.98, 677.98, 677.98, 677.98, 684.45, 684.45, 684.45, 684.45, 684.45, 681.23, 681.23, 681.23, 681.23, 681.23, 677.81, 677.81, 677.81, 677.81, 677.81, 675.64, 675.64, 675.64, 675.64, 675.64, 676.19, 676.19, 676.19, 676.19, 676.19, 679.49, 679.49, 679.49, 679.49, 679.49, 684.69, 684.69, 684.69, 684.69, 684.69, 682.54, 682.54, 682.54, 682.54, 682.54, 686.51, 686.51, 686.51, 686.51, 686.51, 684.57, 684.57, 684.57, 684.57, 684.57, 685.13, 685.13, 685.13, 685.13, 685.13, 691.79, 691.79, 691.79, 691.79, 691.79, 692.01, 692.01, 692.01, 692.01, 692.01, 691.33, 691.33, 691.33, 691.33, 691.33, 688.01, 688.01, 688.01, 688.01, 688.01, 702.3, 702.3, 702.3, 702.3, 702.3, 700.33, 700.33, 700.33, 700.33, 700.33, 697.1, 697.1, 697.1, 697.1, 697.1, 709.86, 709.86, 709.86, 709.86, 709.86, 711.55, 711.55, 711.55, 711.55, 711.55, 710.69, 710.69, 710.69, 710.69, 710.69, 708.45, 708.45, 708.45, 708.45, 708.45, 710.68, 710.68, 710.68, 710.68, 710.68, 715.26, 715.26, 715.26, 715.26, 715.26, 716.95, 716.95, 716.95, 716.95, 716.95, 713.01, 713.01, 713.01, 713.01, 713.01, 706.79, 706.79, 706.79, 706.79, 706.79, 703.22, 703.22, 703.22, 703.22, 703.22, 702.39, 702.39, 702.39, 702.39, 702.39, 702.41, 702.41, 702.41, 702.41, 702.41, 702.69, 702.69, 702.69, 702.69, 702.69, 703.41, 703.41, 703.41, 703.41, 703.41, 701.88, 701.88, 701.88, 701.88, 701.88, 700.97, 700.97, 700.97, 700.97, 700.97, 700.69, 700.69, 700.69, 700.69, 700.69, 704.55, 704.55, 704.55, 704.55, 704.55, 709.4, 709.4, 709.4, 709.4, 709.4, 707.51, 707.51, 707.51, 707.51, 707.51, 706.73, 706.73, 706.73, 706.73, 706.73, 706.14, 706.14, 706.14, 706.14, 706.14, 704.52, 704.52, 704.52, 704.52, 704.52, 705.7, 705.7, 705.7, 705.7, 705.7, 705.04, 705.04, 705.04, 705.04, 705.04, 708.07, 708.07, 708.07, 708.07, 708.07, 707.4, 707.4, 707.4, 707.4, 707.4, 710.05, 710.05, 710.05, 710.05, 710.05, 709.54, 709.54, 709.54, 709.54, 709.54, 716.04, 716.04, 716.04, 716.04, 716.04, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.65, 29.65, 29.65, 29.65, 29.65, 29.42, 29.42, 29.42, 29.42, 29.42, 27.91, 27.91, 27.91, 27.91, 27.91, 25.72, 25.72, 25.72, 25.72, 25.72, 23.69, 23.69, 23.69, 23.69, 23.69, 20.42, 20.42, 20.42, 20.42, 20.42, 17.39, 17.39, 17.39, 17.39, 17.39, 17.16, 17.16, 17.16, 17.16, 17.16, 17.36, 17.36, 17.36, 17.36, 17.36, 17.82, 17.82, 17.82, 17.82, 17.82, 18.27, 18.27, 18.27, 18.27, 18.27, 18.31, 18.31, 18.31, 18.31, 18.31, 18.32, 18.32, 18.32, 18.32, 18.32, 18.17, 18.17, 18.17, 18.17, 18.17, 18.15, 18.15, 18.15, 18.15, 18.15, 18.47, 18.47, 18.47, 18.47, 18.47, 18.8, 18.8, 18.8, 18.8, 18.8, 18.96, 18.96, 18.96, 18.96, 18.96, 19.25, 19.25, 19.25, 19.25, 19.25, 19.31, 19.31, 19.31, 19.31, 19.31, 19.36, 19.36, 19.36, 19.36, 19.36, 19.42, 19.42, 19.42, 19.42, 19.42, 19.46, 19.46, 19.46, 19.46, 19.46, 19.51, 19.51, 19.51, 19.51, 19.51, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.51, 19.51, 19.51, 19.51, 19.51, 19.52, 19.52, 19.52, 19.52, 19.52, 19.49, 19.49, 19.49, 19.49, 19.49, 19.46, 19.46, 19.46, 19.46, 19.46, 19.35, 19.35, 19.35, 19.35, 19.35, 19.25, 19.25, 19.25, 19.25, 19.25, 19.19, 19.19, 19.19, 19.19, 19.19, 18.94, 18.94, 18.94, 18.94, 18.94, 18.78, 18.78, 18.78, 18.78, 18.78, 18.75, 18.75, 18.75, 18.75, 18.75, 18.66, 18.66, 18.66, 18.66, 18.66, 18.54, 18.54, 18.54, 18.54, 18.54, 18.45, 18.45, 18.45, 18.45, 18.45, 18.3, 18.3, 18.3, 18.3, 18.3, 18.19, 18.19, 18.19, 18.19, 18.19, 17.89, 17.89, 17.89, 17.89, 17.89, 17.81, 17.81, 17.81, 17.81, 17.81, 17.82, 17.82, 17.82, 17.82, 17.82, 17.85, 17.85, 17.85, 17.85, 17.85, 17.89, 17.89, 17.89, 17.89, 17.89, 17.94, 17.94, 17.94, 17.94, 17.94, 18.01, 18.01, 18.01, 18.01, 18.01, 18.04, 18.04, 18.04, 18.04, 18.04, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.84, 17.84, 17.84, 17.84, 17.84, 17.76, 17.76, 17.76, 17.76, 17.76, 17.75, 17.75, 17.75, 17.75, 17.75, 17.81, 17.81, 17.81, 17.81, 17.81, 17.87, 17.87, 17.87, 17.87, 17.87, 17.89, 17.89, 17.89, 17.89, 17.89, 17.91, 17.91, 17.91, 17.91, 17.91, 18.02, 18.02, 18.02, 18.02, 18.02, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.42, 0.42, 0.42, 0.42, 0.42, 0.46, 0.46, 0.46, 0.46, 0.46, 0.45, 0.45, 0.45, 0.45, 0.45, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.4, 0.4, 0.4, 0.4, 0.4, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.42, 0.42, 0.42, 0.42, 0.42]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0]

Apr 21 '24 13:04 github-actions[bot]

phi-2-q4_0: 215 iterations

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

Apr 21 '24 13:04 ggerganov

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

Very likely, because we're using phi-2 model which does not have native support for chatml (so <|im_end|> is not a single token - it is broken into multiple tokens)

~~Edit: The simple fix is to bring back the line llama_params["stop"].push_back("<|im_end|>"); in server/utils.hpp. Only chatml <|im_end|> need this special treatment. Other templates like gemma or llama3 don't need this.~~

Apr 21 '24 14:04 ngxson

I think we are incorrectly using a base model instead of instruction-tuned one for this test:

https://huggingface.co/microsoft/phi-2

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

Apr 21 '24 14:04 ggerganov

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

Ah yeah that's right. We can use dolphin-phi2 then. Here is the link: https://huggingface.co/TheBloke/dolphin-2_6-phi-2-GGUF

The <|im_start|>, <|im_end|> and chat template of the HF model are all correct: https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2/blob/main/tokenizer_config.json#L325

Apr 21 '24 14:04 ngxson

llama.cpp llama.cpp copied to clipboard

llama : add option to render special/control tokens

llama.cpp
llama.cpp copied to clipboard