llama.cpp Added llama-3 chat template

trafficstars

This is just simply to add the llama 3 chat template

Apr 18 '24 21:04 DifferentialityDevelopment

This is my first ever pull request, so please feel free to give me feedback on anything I could improve upon.

Apr 18 '24 22:04 DifferentialityDevelopment

@DifferentialityDevelopment thanks for your quick work on getting a PR open, I pulled your changes to llama.cpp and rebuilt, then tried the new template. I'm seeing some issues with it still, but maybe I'm doing something wrong....? I'm running llama.cpp in server mode, with --chat-template llama3. Here's a sample response I got, where it attempts to generate both sides of the conversation:

Hello! It's so nice to chat with you again! I'm your friendly virtual assistant. How's your day going so far? Do you need any help or assistance with anything? I'm all ears (or rather, all text) and happy to lend a hand!assistant

Hi! Thanks for checking in! I'm doing alright, just trying to get some work done and then heading out for a bit later. Nothing too exciting. But it's always great to chat with you. How about you? How's your day going?assistant

Thank you! I'm doing well,.......

Apr 18 '24 22:04 thecivilizedgamer

@DifferentialityDevelopment thanks for your quick work on getting a PR open, I pulled your changes to llama.cpp and rebuilt, then tried the new template. I'm seeing some issues with it still, but maybe I'm doing something wrong....? I'm running llama.cpp in server mode, with --chat-template llama3. Here's a sample response I got, where it attempts to generate both sides of the conversation:
Hello! It's so nice to chat with you again! I'm your friendly virtual assistant. How's your day going so far? Do you need any help or assistance with anything? I'm all ears (or rather, all text) and happy to lend a hand!assistant

Hi! Thanks for checking in! I'm doing alright, just trying to get some work done and then heading out for a bit later. Nothing too exciting. But it's always great to chat with you. How about you? How's your day going?assistant

Thank you! I'm doing well,.......

A lot of the GGUF quants had the eot token not being decoded correctly, and subsequently the model output wouldn't stop appropriately, I was seeing the exact same thing in LM Studio, it's not a problem with llama.cpp itself. For reference this is the GGUF I'm currently using that doesn't have the issue https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Apr 18 '24 23:04 DifferentialityDevelopment

Hi, I think you should also modify file utils.hpp

    llama_params["stop"].push_back("<|im_end|>"); // chatml
    llama_params["stop"].push_back("<end_of_turn>"); // gemma
    llama_params["stop"].push_back("<|eot_id|>"); // llama 3 <-- need this

to make it stop at "<|eot_id|>"

Apr 19 '24 01:04 x4080

@DifferentialityDevelopment thanks for your quick work on getting a PR open, I pulled your changes to llama.cpp and rebuilt, then tried the new template. I'm seeing some issues with it still, but maybe I'm doing something wrong....? I'm running llama.cpp in server mode, with --chat-template llama3. Here's a sample response I got, where it attempts to generate both sides of the conversation:
Hello! It's so nice to chat with you again! I'm your friendly virtual assistant. How's your day going so far? Do you need any help or assistance with anything? I'm all ears (or rather, all text) and happy to lend a hand!assistant

Hi! Thanks for checking in! I'm doing alright, just trying to get some work done and then heading out for a bit later. Nothing too exciting. But it's always great to chat with you. How about you? How's your day going?assistant

Thank you! I'm doing well,.......
A lot of the GGUF quants had the eot token not being decoded correctly, and subsequently the model output wouldn't stop appropriately, I was seeing the exact same thing in LM Studio, it's not a problem with llama.cpp itself. For reference this is the GGUF I'm currently using that doesn't have the issue https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Ah, that makes sense. I switched models and now it's working perfectly. Thanks to you and everyone else for your efforts :)

Apr 19 '24 03:04 thecivilizedgamer

Hi, I think you should also modify file utils.hpp

    llama_params["stop"].push_back("<|im_end|>"); // chatml
    llama_params["stop"].push_back("<end_of_turn>"); // gemma
    llama_params["stop"].push_back("<|eot_id|>"); // llama 3 <-- need this

to make it stop at "<|eot_id|>"

I'm adding this now, thanks for this!

Apr 19 '24 06:04 DifferentialityDevelopment

I'm not 100% sure but I think I know why the one test might be failing tests/test_chat_template.cpp line 79 begins with "<|begin_of_text|>" ie the bos token though I noticed that none of the other models included their bos token in their expected output, and this lines up with the change @ngxson asked me to make to remove the adding of the bos token to the chat template as it's automatically handled. So I'm guessing that removing that would make that test pass

Apr 19 '24 09:04 DifferentialityDevelopment

Yes, you need to remove the BOS text from the reference string

Apr 19 '24 09:04 ggerganov

Yes, you need to remove the BOS text from the reference string

Done!

Apr 19 '24 09:04 DifferentialityDevelopment

Is this missing? {% if loop.index0 == 0 %}{% set content = bos_token + content %}

Apr 19 '24 10:04 foldl

Is this missing? {% if loop.index0 == 0 %}{% set content = bos_token + content %}

As explained in https://github.com/ggerganov/llama.cpp/pull/6751#discussion_r1571972422 , BOS token is added by tokenizer, so it should not appear in the template

Apr 19 '24 10:04 ngxson

@ngxson Oh, thanks. I got it wrong.

The template itself is badly designed. It can be simplified:

{bos_token}{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}

Apr 19 '24 10:04 foldl

FYI, I added llama3 to list of supported templates in wiki page. This PR looks good to me and should get merged now. The failed CI job (build server) doesn't seem to be relevant to changes from this PR. For extra safe, I'll ask to @ggerganov merge it. Thank you all for your efforts.

Apr 19 '24 15:04 ngxson

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 216 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=22543.45ms p(95)=43732.68ms fails=, finish reason: stop=100 truncated=116
Prompt processing (pp): avg=271.68tk/s p(95)=823.61tk/s
Token generation (tg): avg=24.05tk/s p(95)=26.13tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=16f8bba496ef62ee472c9406e13ba0e984e60223

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 216 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713703932 --> 1713704568
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 367.69, 367.69, 367.69, 367.69, 367.69, 387.48, 387.48, 387.48, 387.48, 387.48, 611.36, 611.36, 611.36, 611.36, 611.36, 479.98, 479.98, 479.98, 479.98, 479.98, 479.98, 479.98, 479.98, 479.98, 479.98, 480.52, 480.52, 480.52, 480.52, 480.52, 479.25, 479.25, 479.25, 479.25, 479.25, 475.46, 475.46, 475.46, 475.46, 475.46, 492.43, 492.43, 492.43, 492.43, 492.43, 500.7, 500.7, 500.7, 500.7, 500.7, 509.11, 509.11, 509.11, 509.11, 509.11, 520.94, 520.94, 520.94, 520.94, 520.94, 549.7, 549.7, 549.7, 549.7, 549.7, 563.67, 563.67, 563.67, 563.67, 563.67, 565.03, 565.03, 565.03, 565.03, 565.03, 565.41, 565.41, 565.41, 565.41, 565.41, 568.34, 568.34, 568.34, 568.34, 568.34, 568.46, 568.46, 568.46, 568.46, 568.46, 576.7, 576.7, 576.7, 576.7, 576.7, 580.98, 580.98, 580.98, 580.98, 580.98, 586.15, 586.15, 586.15, 586.15, 586.15, 590.4, 590.4, 590.4, 590.4, 590.4, 589.54, 589.54, 589.54, 589.54, 589.54, 604.12, 604.12, 604.12, 604.12, 604.12, 607.13, 607.13, 607.13, 607.13, 607.13, 609.29, 609.29, 609.29, 609.29, 609.29, 622.63, 622.63, 622.63, 622.63, 622.63, 621.47, 621.47, 621.47, 621.47, 621.47, 627.28, 627.28, 627.28, 627.28, 627.28, 635.78, 635.78, 635.78, 635.78, 635.78, 640.1, 640.1, 640.1, 640.1, 640.1, 639.86, 639.86, 639.86, 639.86, 639.86, 639.86, 639.86, 639.86, 639.86, 639.86, 644.08, 644.08, 644.08, 644.08, 644.08, 652.45, 652.45, 652.45, 652.45, 652.45, 652.62, 652.62, 652.62, 652.62, 652.62, 652.01, 652.01, 652.01, 652.01, 652.01, 651.21, 651.21, 651.21, 651.21, 651.21, 644.7, 644.7, 644.7, 644.7, 644.7, 645.64, 645.64, 645.64, 645.64, 645.64, 645.9, 645.9, 645.9, 645.9, 645.9, 647.0, 647.0, 647.0, 647.0, 647.0, 647.29, 647.29, 647.29, 647.29, 647.29, 648.5, 648.5, 648.5, 648.5, 648.5, 647.78, 647.78, 647.78, 647.78, 647.78, 646.35, 646.35, 646.35, 646.35, 646.35, 646.07, 646.07, 646.07, 646.07, 646.07, 656.41, 656.41, 656.41, 656.41, 656.41, 655.08, 655.08, 655.08, 655.08, 655.08, 653.91, 653.91, 653.91, 653.91, 653.91, 653.26, 653.26, 653.26, 653.26, 653.26, 654.26, 654.26, 654.26, 654.26, 654.26, 655.3, 655.3, 655.3, 655.3, 655.3, 655.16, 655.16, 655.16, 655.16, 655.16, 655.08, 655.08, 655.08, 655.08, 655.08, 659.4, 659.4, 659.4, 659.4, 659.4, 659.42, 659.42, 659.42, 659.42, 659.42, 661.24, 661.24, 661.24, 661.24, 661.24, 660.87, 660.87, 660.87, 660.87, 660.87, 666.13, 666.13, 666.13, 666.13, 666.13, 670.86, 670.86, 670.86, 670.86, 670.86, 670.86, 670.86, 670.86, 670.86]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 216 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713703932 --> 1713704568
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.34, 29.34, 29.34, 29.34, 29.34, 28.0, 28.0, 28.0, 28.0, 28.0, 26.44, 26.44, 26.44, 26.44, 26.44, 26.45, 26.45, 26.45, 26.45, 26.45, 26.45, 26.45, 26.45, 26.45, 26.45, 21.85, 21.85, 21.85, 21.85, 21.85, 20.17, 20.17, 20.17, 20.17, 20.17, 16.8, 16.8, 16.8, 16.8, 16.8, 16.87, 16.87, 16.87, 16.87, 16.87, 17.53, 17.53, 17.53, 17.53, 17.53, 17.96, 17.96, 17.96, 17.96, 17.96, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.69, 18.69, 18.69, 18.69, 18.69, 18.75, 18.75, 18.75, 18.75, 18.75, 19.13, 19.13, 19.13, 19.13, 19.13, 19.33, 19.33, 19.33, 19.33, 19.33, 19.45, 19.45, 19.45, 19.45, 19.45, 19.72, 19.72, 19.72, 19.72, 19.72, 19.83, 19.83, 19.83, 19.83, 19.83, 19.86, 19.86, 19.86, 19.86, 19.86, 19.85, 19.85, 19.85, 19.85, 19.85, 19.86, 19.86, 19.86, 19.86, 19.86, 19.93, 19.93, 19.93, 19.93, 19.93, 20.04, 20.04, 20.04, 20.04, 20.04, 20.23, 20.23, 20.23, 20.23, 20.23, 20.23, 20.23, 20.23, 20.23, 20.23, 20.21, 20.21, 20.21, 20.21, 20.21, 20.13, 20.13, 20.13, 20.13, 20.13, 19.99, 19.99, 19.99, 19.99, 19.99, 19.76, 19.76, 19.76, 19.76, 19.76, 19.62, 19.62, 19.62, 19.62, 19.62, 19.62, 19.62, 19.62, 19.62, 19.62, 19.46, 19.46, 19.46, 19.46, 19.46, 19.36, 19.36, 19.36, 19.36, 19.36, 19.22, 19.22, 19.22, 19.22, 19.22, 19.2, 19.2, 19.2, 19.2, 19.2, 19.19, 19.19, 19.19, 19.19, 19.19, 18.6, 18.6, 18.6, 18.6, 18.6, 18.33, 18.33, 18.33, 18.33, 18.33, 18.13, 18.13, 18.13, 18.13, 18.13, 18.07, 18.07, 18.07, 18.07, 18.07, 17.95, 17.95, 17.95, 17.95, 17.95, 17.95, 17.95, 17.95, 17.95, 17.95, 17.96, 17.96, 17.96, 17.96, 17.96, 18.02, 18.02, 18.02, 18.02, 18.02, 18.05, 18.05, 18.05, 18.05, 18.05, 18.15, 18.15, 18.15, 18.15, 18.15, 18.12, 18.12, 18.12, 18.12, 18.12, 18.09, 18.09, 18.09, 18.09, 18.09, 18.04, 18.04, 18.04, 18.04, 18.04, 17.87, 17.87, 17.87, 17.87, 17.87, 17.81, 17.81, 17.81, 17.81, 17.81, 17.82, 17.82, 17.82, 17.82, 17.82, 17.88, 17.88, 17.88, 17.88, 17.88, 17.92, 17.92, 17.92, 17.92, 17.92, 17.95, 17.95, 17.95, 17.95, 17.95, 18.01, 18.01, 18.01, 18.01, 18.01, 18.11, 18.11, 18.11, 18.11, 18.11, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 216 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713703932 --> 1713704568
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.39, 0.39, 0.39, 0.39, 0.39, 0.47, 0.47, 0.47, 0.47, 0.47, 0.48, 0.48, 0.48, 0.48, 0.48, 0.51, 0.51, 0.51, 0.51, 0.51, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.49, 0.49, 0.49, 0.49, 0.49, 0.51, 0.51, 0.51, 0.51, 0.51, 0.4, 0.4, 0.4, 0.4, 0.4, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.4, 0.4, 0.4, 0.4, 0.4, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 216 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713703932 --> 1713704568
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0]

Apr 19 '24 17:04 github-actions[bot]

Hi @DifferentialityDevelopment - per the model card on hugging face, <|eot_id|> is a stop token, but it seems like it might not be the only one? If you look at the pytorch inference code in the model card, you'll see this:

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

<|eot_id|> is in this PR, but if you check the tokenizer_config.json and tokenizer.json files in the repo, pipeline.tokenizer.eos_token_id in the above snippet refers to <|end_of_text|> and is therefore also a stop token.

While <|eot_id|> is in this PR, <|end_of_text|> doesn't see to be. Am I missing something?

Apr 19 '24 19:04 K-Mistele

Hi @DifferentialityDevelopment - per the model card on hugging face, <|eot_id|> is a stop token, but it seems like it might not be the only one? If you look at the pytorch inference code in the model card, you'll see this:
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
<|eot_id|> is in this PR, but if you check the tokenizer_config.json and tokenizer.json files in the repo, pipeline.tokenizer.eos_token_id in the above snippet refers to <|end_of_text|> and is therefore also a stop token.

While <|eot_id|> is in this PR, <|end_of_text|> doesn't see to be. Am I missing something?

Nicely spotted! That's actually strange, why would there be two different eot tokens.. I figure that the end_of_text is specifically used to seperate different chat's altogether, where as eot_id is simply the end of a specific characters message, as such I don't think it's strictly required to add to the chat template?

Apr 19 '24 19:04 DifferentialityDevelopment

I think i read somewhere (so take this with a grain of salt, because I don't remember where) that it's an artifact of how they did the instruct-tuning on top of the base model - the base model uses <|end_of_text|>, and the instruct model (sometimes?) uses <|eot_id|>, and sometimes uses <|end_of_text|>.

A cursory glance at the base model's card seems to support this - there's nothing about the custom terminators array or using <|eot_id|> as a terminator, it seems to use the default tokenizer config which is <|end_of_text|>.

Maybe it doesn't need to be added to the chat template, and just needs to be configured as a stop tokens? I'm not sure how that would be done.

But, the llama3 github repository does use both as stop tokens in tokenizer.py:

        # from the tokenizer init method
        self.n_words: int = self.model.n_vocab
        # BOS / EOS token IDs
        self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
        self.eos_id: int = self.special_tokens["<|end_of_text|>"]
        self.pad_id: int = -1
        self.stop_tokens = {
            self.special_tokens["<|end_of_text|>"],
            self.special_tokens["<|eot_id|>"],
        }
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )

So, maybe it would be best to just mirror that behavior here and add the following?

llama_params["stop"].push_back("<|eot_id|>");

// add me too!
llama_params["stop"].push_back("<|end_of_text|>");

Apr 19 '24 19:04 K-Mistele

I think i read somewhere (so take this with a grain of salt, because I don't remember where) that it's an artifact of how they did the instruct-tuning on top of the base model - the base model uses <|end_of_text|>, and the instruct model (sometimes?) uses <|eot_id|>, and sometimes uses <|end_of_text|>.

A cursory glance at the base model's card seems to support this - there's nothing about the custom terminators array or using <|eot_id|> as a terminator, it seems to use the default tokenizer config which is <|end_of_text|>.

Maybe it doesn't need to be added to the chat template, and just needs to be configured as a stop tokens? I'm not sure how that would be done.

But, the llama3 github repository does use both as stop tokens in tokenizer.py:
        # from the tokenizer init method
        self.n_words: int = self.model.n_vocab
        # BOS / EOS token IDs
        self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
        self.eos_id: int = self.special_tokens["<|end_of_text|>"]
        self.pad_id: int = -1
        self.stop_tokens = {
            self.special_tokens["<|end_of_text|>"],
            self.special_tokens["<|eot_id|>"],
        }
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )
So, maybe it would be best to just mirror that behavior here and add the following?
llama_params["stop"].push_back("<|eot_id|>");

// add me too!
llama_params["stop"].push_back("<|end_of_text|>");

I've added end_of_text as another stop token for llama 3 in utils.hpp

Apr 19 '24 20:04 DifferentialityDevelopment

<|end_of_text|> is the EOS token, so you don't need to include it in the list of stop words. In short, server will stop generation if it receives EOS token. For this reason, you don't see </s> (for llama2), because it's already the EOS of llama2

Apr 19 '24 23:04 ngxson

<|end_of_text|> is the EOS token, so you don't need to include it in the list of stop words. In short, server will stop generation if it receives EOS token. For this reason, you don't see </s> (for llama2), because it's already the EOS of llama2

I've removed it as a stop word..

Apr 19 '24 23:04 DifferentialityDevelopment

Yes, it should be removed. If we decide to add EOS token as stop sequence, we will also need to add for other templates (</s>, <|EOT|>, ...)

Apr 19 '24 23:04 ngxson

<|eot_id|> is End of Turn.

Meta always includes the templates in their source code. Should always reference it as a guide.

The end of each message is marked by the <|eot_id|> token. source

Apr 20 '24 06:04 teleprint-me

@ngxson @ggerganov Can we merge it ?

Apr 20 '24 14:04 phymbert

Yes it looks good to me. I’m just wondering if we want to wait for the other PR that allows converting the model, then test the converted model with this template before actually merge it?

Apr 20 '24 14:04 ngxson

Yes, I see the other after, better to wait

Apr 20 '24 14:04 phymbert

Yes it looks good to me. I’m just wondering if we want to wait for the other PR that allows converting the model, then test the converted model with this template before actually merge it?

Will converting the model help fixing the "broken IQ and imatrix quants" https://github.com/ggerganov/llama.cpp/issues/6747#issuecomment-2067643457, or are all problems originating from the base model itself?

Apr 20 '24 14:04 reneleonhardt

Yes it looks good to me. I’m just wondering if we want to wait for the other PR that allows converting the model, then test the converted model with this template before actually merge it?

Yes, let's first merge #6745

Apr 20 '24 14:04 ggerganov

Hi, in the latest version "<|eot_id|>" appears at the end of conversation, it seems utils.hpp dont have the stop token now

Apr 25 '24 21:04 x4080

It's due to a different pull request that got merged I think https://github.com/ggerganov/llama.cpp/commit/b97bc3966e852adb626c90be64fd48282800f504#diff-ad8b15a29dd7c625dd2688de421972baaa73494a72d7210d679efc5f2ec0d888

llama_token_is_eog is supposed to return true for <|eot_id|> as far as I'm aware

bool llama_token_is_eog(const struct llama_model * model, llama_token token) { return token != -1 && ( token == llama_token_eos(model) || token == llama_token_eot(model) ); }

Apr 25 '24 22:04 DifferentialityDevelopment

I'm seeing the same issue, with not only Llama 3 but also Phi 3 as described in this issue https://github.com/ggerganov/llama.cpp/issues/6903

Should a new issue be opened specifically for the Llama 3 stop token problem?

Apr 25 '24 22:04 thecivilizedgamer

llama.cpp llama.cpp copied to clipboard

Added llama-3 chat template

llama.cpp
llama.cpp copied to clipboard