llama.cpp Refactor chat template API (WIP)

Refactor chat template API (WIP)

Open ngxson opened this issue 3 months ago • 2 comments

Based on the discussion from https://github.com/ggerganov/llama.cpp/issues/6391#issuecomment-2068353974

We introduce an enum llama_chat_template for templates and a family of functions:

    /// Get the Jinja model saved inside given model
    /// @param model The pointer to llama_model
    /// @param name Template name (can be a nullptr for default template). See: https://github.com/ggerganov/llama.cpp/pull/6588
    /// @param buf The output buffer
    /// @param length The size of the allocated buffer
    /// @return The total number of bytes of the template. If a named template cannot be found, it will use default template. If no template can be found, it returns -1
    LLAMA_API int32_t llama_chat_get_model_template(
                const struct llama_model * model,
                              const char * name,
                                    char * buf,
                                 int32_t   length);

    /// Get the enum llama_chat_template based on Jinja template
    /// @param tmpl Jinja template (a string)
    /// @return The currect enum llama_chat_template
    LLAMA_API llama_chat_template llama_chat_get_template_type(const char * tmpl);

    /// Get the format prefix for a given message
    /// @param tmpl Use enum llama_chat_template
    /// @param role The role of the current message
    /// @param prev_role The role of the previous message, can be nullptr
    /// @param buf The output buffer
    /// @param length The size of the allocated buffer
    /// @return The total number of bytes of the output string
    LLAMA_API int32_t llama_chat_get_prefix(
                const llama_chat_template   tmpl,
                               const char * role,
                               const char * prev_role,
                                     char * buf,
                                  int32_t   length);

    /// Get the format postfix for a given message
    /// @param tmpl Use enum llama_chat_template
    /// @param role The role of the current message
    /// @param prev_role The role of the previous message, can be nullptr
    /// @param buf The output buffer
    /// @param length The size of the allocated buffer
    /// @return The total number of bytes of the output string
    LLAMA_API int32_t llama_chat_get_postfix(
                const llama_chat_template   tmpl,
                               const char * role,
                               const char * prev_role,
                                     char * buf,
                                  int32_t   length);

    /// Check if a given template support system message or not
    LLAMA_API bool llama_chat_support_system_message(const llama_chat_template tmpl);

Apr 22 '24 05:04 ngxson

@teleprint-me @hanishkvc I finally made this to work. The code is not super clean, but I pay more attention to the API design as it's what we need to bring chat template support to main. Feel free to let me know if something (in API design level) that can be improved.

Apr 22 '24 06:04 ngxson

Before moving further, @ggerganov could you please take a look on the API design to see if that's OK for you? Thanks.

Apr 22 '24 07:04 ngxson

The API seems OK. If you think this is the right way, let's do it

Apr 22 '24 13:04 ggerganov

Given the context and circumstances, I think it's a start.

I can absolutely see this getting out of control, though, as I've previously stated, if not done with caution and forethought. This is going to be challenging to manage in the long term simply because there is no way to know or predict what templates will arise, or become preferred, over time.

Overall, I think it's okay as well. It obviously needs work, as that's how all things start. Hopefully we can identify a pattern and then determine how to smooth things out over time. It's better than nothing for now.

Apr 22 '24 18:04 teleprint-me

I agree this can get easily over-engineered. I don't have capacity atm to think deeply into this, so we should try to take into account feedback from people using chat templates and at the same time don't try to support all sorts of edge cases that one can think of. Just aim for the stuff that is used most of the time and makes sense. And try to keep the API and implementation separated from the rest of the functionality as much as possible so that it can be easily adapted / replaced in the future if necessary

Apr 22 '24 19:04 ggerganov

@ngxson do have a look at the new PR

https://github.com/ggerganov/llama.cpp/pull/6834

which I have uploaded, it uses a simple json to load the expected/supported handshake-template as well as flag to control whether any BoS is prefixed when a user message immidiately follows the system message. Inturn the chat-template-apply which I have added in common/chaton.hpp, handles the same to try provide the required flow in a simple generic way.

Also the json which should work wrt some of the models is in examples/chaton_meta.json

NOTE: Among these, the 1 or 2 model which requires avoiding special tags between system message and 1st user message seems to treat BoS + RoleTagPrefix as a single bunch and expect both to be treated the same way. However some other models may require BoS to be handled specially while RoleTagPrefix to be handled the same always, for that in my logic I will have to add a seperate Begin/BoS entry other than the Prefix entry. and inturn do that selective inserting for Begin.

Apr 23 '24 08:04 hanishkvc

@ngxson do have a look at the new PR

#6834

which I have uploaded, it uses a simple json to load the expected/supported handshake-template as well as flag to control whether any BoS is prefixed when a user message immidiately follows the system message. Inturn the chat-template-apply which I have added in common/chaton.hpp, handles the same to try provide the required flow in a simple generic way.

Also the json which should work wrt some of the models is in examples/chaton_meta.json

NOTE: Among these, the 1 or 2 model which requires avoiding special tags between system message and 1st user message seems to treat BoS + RoleTagPrefix as a single bunch and expect both to be treated the same way. However some other models may require BoS to be handled specially while RoleTagPrefix to be handled the same always, for that in my logic I will have to add a seperate Begin/BoS entry other than the Prefix entry. and inturn do that selective inserting for Begin.

Have updated my PR, with support for seperate Begin(BoS) and Prefix (RoleIdTag) wrt User role. And in the json one can individually control whether either of them get prepended to 1st user message following the system message. Monarch seems to need it from your server related chat-apply-template, and the same is supported now. Looking at the entries wrt Llama2, Monarch and Llama3, one can see how to configure the entries in the json file, to achive the 3 different possibilities wrt these 3 models.

Apr 23 '24 10:04 hanishkvc

@teleprint-me @ggerganov Thanks for your feedback. I understand that this part can get complicated easily in the future, so these things are being considered when I made this proposal:

It allows using multiple chat templates (which is introduced in #6588)
Prefix/postfix and content can be tokenized separately. This mitigate the risk of injecting special tokens into message content. While there's no API currently using this logic, but we can easily add one in the future.
Since each chat template now having its own enum value, users can extend their logic by using value given by llama_chat_get_template_type. No arbitrary templates are allowed (user must write their own logic if they want)

The only downside is that now the code is no longer linear. That means adding new template now requires a bit of "brain gym" to convert from jinja to prefix/postfix. Still, it is better than tricking llama_chat_apply_template to output the correct thing (as demo in #6810)

Edit: Please also note that though multiple issue on the subject of chat templates, I've seen many proposals related to having postfix/prefix based on role. This PR will be the first one to bring that idea into the core API.

Apr 24 '24 13:04 ngxson

@ggerganov @phymbert I got a weird issue on the CI workflow where the master branch get merged automatically to the code on CI. Do you have some clue about that? Thanks. https://github.com/ggerganov/llama.cpp/actions/runs/8817445642/job/24203848022?pr=6822

Apr 24 '24 16:04 ngxson

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 204 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=23519.97ms p(95)=44773.36ms fails=, finish reason: stop=82 truncated=122
Prompt processing (pp): avg=277.25tk/s p(95)=809.3tk/s
Token generation (tg): avg=19.06tk/s p(95)=25.73tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=xsn/chat_template_prefix_postfix commit=476d319fde0ae6c6a2ed9cfe54e548ad812fe5a5

prompt_tokens_seconds

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 204 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713981570 --> 1713982200
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 257.93, 257.93, 257.93, 257.93, 257.93, 265.42, 265.42, 265.42, 265.42, 265.42, 352.97, 352.97, 352.97, 352.97, 352.97, 446.55, 446.55, 446.55, 446.55, 446.55, 472.12, 472.12, 472.12, 472.12, 472.12, 484.58, 484.58, 484.58, 484.58, 484.58, 484.25, 484.25, 484.25, 484.25, 484.25, 483.04, 483.04, 483.04, 483.04, 483.04, 484.11, 484.11, 484.11, 484.11, 484.11, 505.07, 505.07, 505.07, 505.07, 505.07, 526.65, 526.65, 526.65, 526.65, 526.65, 529.27, 529.27, 529.27, 529.27, 529.27, 549.82, 549.82, 549.82, 549.82, 549.82, 559.64, 559.64, 559.64, 559.64, 559.64, 563.95, 563.95, 563.95, 563.95, 563.95, 564.42, 564.42, 564.42, 564.42, 564.42, 563.66, 563.66, 563.66, 563.66, 563.66, 563.73, 563.73, 563.73, 563.73, 563.73, 565.79, 565.79, 565.79, 565.79, 565.79, 568.42, 568.42, 568.42, 568.42, 568.42, 576.1, 576.1, 576.1, 576.1, 576.1, 576.0, 576.0, 576.0, 576.0, 576.0, 586.45, 586.45, 586.45, 586.45, 586.45, 586.48, 586.48, 586.48, 586.48, 586.48, 588.24, 588.24, 588.24, 588.24, 588.24, 580.94, 580.94, 580.94, 580.94, 580.94, 582.75, 582.75, 582.75, 582.75, 582.75, 583.04, 583.04, 583.04, 583.04, 583.04, 598.11, 598.11, 598.11, 598.11, 598.11, 597.59, 597.59, 597.59, 597.59, 597.59, 607.22, 607.22, 607.22, 607.22, 607.22, 617.07, 617.07, 617.07, 617.07, 617.07, 619.97, 619.97, 619.97, 619.97, 619.97, 619.05, 619.05, 619.05, 619.05, 619.05, 623.89, 623.89, 623.89, 623.89, 623.89, 625.73, 625.73, 625.73, 625.73, 625.73, 630.39, 630.39, 630.39, 630.39, 630.39, 630.67, 630.67, 630.67, 630.67, 630.67, 627.19, 627.19, 627.19, 627.19, 627.19, 624.68, 624.68, 624.68, 624.68, 624.68, 623.9, 623.9, 623.9, 623.9, 623.9, 623.79, 623.79, 623.79, 623.79, 623.79, 624.4, 624.4, 624.4, 624.4, 624.4, 625.99, 625.99, 625.99, 625.99, 625.99, 626.49, 626.49, 626.49, 626.49, 626.49, 627.43, 627.43, 627.43, 627.43, 627.43, 627.3, 627.3, 627.3, 627.3, 627.3, 626.37, 626.37, 626.37, 626.37, 626.37, 633.31, 633.31, 633.31, 633.31, 633.31, 638.2, 638.2, 638.2, 638.2, 638.2, 638.56, 638.56, 638.56, 638.56, 638.56, 638.01, 638.01, 638.01, 638.01, 638.01, 636.61, 636.61, 636.61, 636.61, 636.61, 639.75, 639.75, 639.75, 639.75, 639.75, 641.03, 641.03, 641.03, 641.03, 641.03, 640.85, 640.85, 640.85, 640.85, 640.85, 640.86, 640.86, 640.86, 640.86, 640.86, 644.06, 644.06, 644.06, 644.06, 644.06, 644.37, 644.37, 644.37, 644.37, 644.37, 645.09, 645.09, 645.09, 645.09, 645.09, 645.03, 645.03, 645.03, 645.03, 645.03, 645.03]

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 204 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713981570 --> 1713982200
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.02, 33.02, 33.02, 33.02, 33.02, 31.51, 31.51, 31.51, 31.51, 31.51, 26.1, 26.1, 26.1, 26.1, 26.1, 26.1, 26.1, 26.1, 26.1, 26.1, 24.63, 24.63, 24.63, 24.63, 24.63, 23.05, 23.05, 23.05, 23.05, 23.05, 20.95, 20.95, 20.95, 20.95, 20.95, 17.3, 17.3, 17.3, 17.3, 17.3, 17.4, 17.4, 17.4, 17.4, 17.4, 18.53, 18.53, 18.53, 18.53, 18.53, 18.53, 18.53, 18.53, 18.53, 18.53, 18.9, 18.9, 18.9, 18.9, 18.9, 19.1, 19.1, 19.1, 19.1, 19.1, 19.09, 19.09, 19.09, 19.09, 19.09, 18.96, 18.96, 18.96, 18.96, 18.96, 18.83, 18.83, 18.83, 18.83, 18.83, 18.71, 18.71, 18.71, 18.71, 18.71, 18.79, 18.79, 18.79, 18.79, 18.79, 19.05, 19.05, 19.05, 19.05, 19.05, 19.16, 19.16, 19.16, 19.16, 19.16, 19.29, 19.29, 19.29, 19.29, 19.29, 19.45, 19.45, 19.45, 19.45, 19.45, 19.46, 19.46, 19.46, 19.46, 19.46, 19.48, 19.48, 19.48, 19.48, 19.48, 19.52, 19.52, 19.52, 19.52, 19.52, 19.56, 19.56, 19.56, 19.56, 19.56, 19.66, 19.66, 19.66, 19.66, 19.66, 19.78, 19.78, 19.78, 19.78, 19.78, 19.83, 19.83, 19.83, 19.83, 19.83, 19.8, 19.8, 19.8, 19.8, 19.8, 19.79, 19.79, 19.79, 19.79, 19.79, 19.73, 19.73, 19.73, 19.73, 19.73, 19.66, 19.66, 19.66, 19.66, 19.66, 19.38, 19.38, 19.38, 19.38, 19.38, 19.3, 19.3, 19.3, 19.3, 19.3, 19.26, 19.26, 19.26, 19.26, 19.26, 19.16, 19.16, 19.16, 19.16, 19.16, 19.1, 19.1, 19.1, 19.1, 19.1, 18.8, 18.8, 18.8, 18.8, 18.8, 18.63, 18.63, 18.63, 18.63, 18.63, 18.61, 18.61, 18.61, 18.61, 18.61, 18.09, 18.09, 18.09, 18.09, 18.09, 18.0, 18.0, 18.0, 18.0, 18.0, 17.87, 17.87, 17.87, 17.87, 17.87, 17.85, 17.85, 17.85, 17.85, 17.85, 17.85, 17.85, 17.85, 17.85, 17.85, 17.86, 17.86, 17.86, 17.86, 17.86, 17.93, 17.93, 17.93, 17.93, 17.93, 17.98, 17.98, 17.98, 17.98, 17.98, 17.98, 17.98, 17.98, 17.98, 17.98, 17.93, 17.93, 17.93, 17.93, 17.93, 17.91, 17.91, 17.91, 17.91, 17.91, 17.8, 17.8, 17.8, 17.8, 17.8, 17.73, 17.73, 17.73, 17.73, 17.73, 17.72, 17.72, 17.72, 17.72, 17.72, 17.73, 17.73, 17.73, 17.73, 17.73, 17.83, 17.83, 17.83, 17.83, 17.83, 17.85, 17.85, 17.85, 17.85, 17.85, 17.86, 17.86, 17.86, 17.86, 17.86, 17.88, 17.88, 17.88, 17.88, 17.88, 17.88, 17.88, 17.88, 17.88, 17.88, 17.93]

Details

kv_cache_usage_ratio

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 204 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713981570 --> 1713982200
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.45, 0.45, 0.45, 0.45, 0.45, 0.53, 0.53, 0.53, 0.53, 0.53, 0.42, 0.42, 0.42, 0.42, 0.42, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.41, 0.41, 0.41, 0.41, 0.41, 0.46, 0.46, 0.46, 0.46, 0.46, 0.44, 0.44, 0.44, 0.44, 0.44, 0.47, 0.47, 0.47, 0.47, 0.47, 0.45, 0.45, 0.45, 0.45, 0.45, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.37, 0.37, 0.37, 0.37, 0.37, 0.39, 0.39, 0.39, 0.39, 0.39, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24]

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 204 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713981570 --> 1713982200
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]

Apr 24 '24 17:04 github-actions[bot]

I'm changing this PR to "demo" since I'm still not very confident to make the chat template system become more complicated. Maybe we will re-visit this in the future. This PR is mostly useful for adding chat templates to main.cpp, but atm it's not a priority.

May 04 '24 08:05 ngxson

llama.cpp llama.cpp copied to clipboard

Refactor chat template API (WIP)

llama.cpp
llama.cpp copied to clipboard