llama.cpp
llama.cpp copied to clipboard
Generic Chat templating code with text/json file based config; main chat updated to drive its in-prefix, in-suffix and reverse-prompt from same; chat-apply-template equivalent c-api to allow use by other codes also
*** Updated to match latest commit ***
Overview
Helps chat with models, by tagging chat messages based on the specified chat-handshake-template-standard. This uses a generic tagging code driven by a json meta data file, which specifies the handshake template details.
This can be used by
-
main, to build on existing interactive flow and its in-prefix, in-suffix and antiprompt/reverse-prompt
-
server, by replacing its existing llama_chat_apply_template with the equivalent helper here.
The common pattern
As a convention, the tagging used by LLMs to differentiate between the different parts when chatting with them normally follows a general pattern of
-
<BeginOfSentenceIfAny> <RolePrefixIfAny> <TheContent> <RoleSuffixIfAny> <EndOfSentenceIfAny>
-
The Roles could include System, User and Assistant (ie the Model)
-
A chat normally consists of
-
a System message/prompt followed by
-
multiple user message/query - model message/response pairs
-
The different models will normally have all or some subset of the tagging mentioned above.
You may also notice some common patterns like
-
Because a user message is normally followed by model/assistant response, in most models
-
user messages wont have EndOfSentenceTag and
-
the following model response wont have BeginOfSentenceTag
-
-
Because a system message will normally be immidiately followed by a user query,
-
in many models, there wont be a EndOfSentenceTag following the system message and BeginOfSentenceTag wrt the 1st user message following the system message.
-
in some models there wont even be a RoleSuffixTag following system message and RolePrefixTag wrt the 1st user message following the system message.
-
however in many of these models, the subsequent user messages will have the BeginOfSentenceTag and or RolePrefixTag.
-
The Strategy
The template meta data json file allows the user to specify the above mentioned tags wrt each of the Role. Depending on whether a given model uses a given tag or not you either specify the required tag or else you specify a empty string.
A tag could be a single word or multiple words, and may include newline char specified using \n and so on. The tag is always demarcated using double quotes and thus also allows spaces at the begining or end of the tag, if needed.
In order to account for the conditionality of tags between the system message and the 1st user message, flags are provided to explicitly control whether each of these possible tags is used by a specific model or not, as part of its template info.
The Roles are identified in the json file using "system", "user" and "assistant". However the model may use different words to identify these roles, in which case setup RolePrefix and or RoleSuffix appropriately.
To identify that model is finished with generating response to user query, depending on the model's handshake template standard, one will need to set the reverse-prompt to either the assistant's suffix or end tag or to the user's begin or prefix tag, depending on what is generated by the model at the end of its response.
The JSON File
Can contain the template info wrt multiple models/handshake-standards. And inturn each unique template is identified by a unique template id string.
The fields that make up a given chat-handshake-template-standard include
-
global-> begin & end
-
system -> begin, prefix, suffix & end
-
user -> begin, prefix, suffix & end
-
assistant -> begin, prefix, suffix & end
-
reverse-prompt
-
systemuser-system-has-suffix, systemuser-system-has-end, systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix
Usage
One needs to load the json file containing the template meta data and inturn call the other helper functions as needed.
Inturn one can use the helper functions to either extract a given tag or to apply all tags specified wrt a given role to the passed message or to apply tags as needed for a bunch of messages in one go.
The individual message tagging helper, will apply all tags specified wrt that role.
The multiple messages tagging helper chaton-tmpl-apply, will look at the boolean flags when tagging the passed messages. In this the system suffix, system end, user begin and user prefix get included only if corresponding flag is set.
Both the single and multi messages tagging helpers provide two versions.
- one which returns a single string which contains the tagged message(s)
- one which returns
- [tagged msg] the string containing the tagged message(s)
- [parts lengths] an array of integers, which specifies the part lengths, which divides the returned string into parts.
- [parts types] a string where each character indicates whether the corresponding part is a normal part which needs to be tokenized without parse_special or is a special part which needs to be tokenized with parse-special.
example/main
The interactive commandline program under example/main, uses
- the system role related tags to tag the system prompt
- the system prompt includes contents of -p if any
- followed by contents of file specified using -f if any
- the user begin+prefix to map to in-prefix
- the user suffix+end to map to in-suffix
- the reverse-prompt to map to antiprompt
- wrt tokenization
- the user specified system prompt is tokenized with parse_special flag.
- however the user messages are tokenized without parse_special flag.
Currently Main doesnt use chaton-tmpl-apply, but only
- chaton-tmpl-apply-single (for system prompt) and
- chaton-tmpl-role-kv which maps the user prefix, suffix and reverse-prompt to in-prefix, in-suffix and antiprompt of main. These always adds any role specific begin+prefix and suffix+end around the passed message.
Adding support for new model / chat-handshake-template-standard
- Add suitable entries in json for that model/standard
- Try to reuse the generic flow in chaton-tmpl-apply, as much as possible, before trying to add a custom logic. If you update the generic flow, cross check if existing json files will need to be updated or not.
Notes
Look at the sample chaton_meta.json in examples folder for how the above may apply
- llama2, llama3, gemma, chatml, zephyr, deepseek(normal and coder), monarch, mistral
This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright.
Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it.
There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.
This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright.
Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it.
There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.
By fill in the middle, if you mean that if the user message has special-token related tags in it which inturn when being tokenised will treat them has special tokens, which can mess with things etal, then if you look at the flow wrt main, the user message is tokenized without parse_special flag.
However my generic chat-apply-template currently, doesnt handle this, because it would require returning a vector of strings rather than a single string, as noted in the PR comment. Which if I am not wrong would be different from how others expect chat-apply-template to work, so I havent decided on the same, nor have I looked into other libraries chat-apply-template in detail, I am guessing a bit here.
However if you mean something else, please do explain a bit, so I can see if I can do something about it. Do note that I am not a big user of current crop of LLMs for various reasons, while still do look at it once in a while to see where things are, so I am not that tuned in with the conventions / concept-names etal.
I wanted a simple program with minimal inter dependencies to use on my limited resources based machine, and I had some issues with ollama and llama3, so I just hacked this in with mostly guess work and crude generalisation by looking at existing flow to some extent and what I was seeing when I experimented on what I needed. I am hacking xyz, without understanding abc in some sense.
This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright. Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it. There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.
By fill in the middle, if you mean that if the user message has special-token related tags in it which inturn when being tokenised will treat them has special tokens, which can mess with things etal, then if you look at the flow wrt main, the user message is tokenized without parse_special flag.
However my generic chat-apply-template currently, doesnt handle this, because it would require returning a vector of strings rather than a single string, as noted in the PR comment. Which if I am not wrong would be different from how others expect chat-apply-template to work, so I havent decided on the same, nor have I looked into other libraries chat-apply-template in detail, I am guessing a bit here.
However if you mean something else, please do explain a bit, so I can see if I can do something about it. Do note that I am not a big user of current crop of LLMs for various reasons, while still do look at it once in a while to see where things are, so I am not that tuned in with the conventions / concept-names etal.
I wanted a simple program with minimal inter dependencies to use on my limited resources based machine, and I had some issues with ollama and llama3, so I just hacked this in with mostly guess work and crude generalisation by looking at existing flow to some extent and what I was seeing when I experimented on what I needed. I am hacking xyz, without understanding abc in some sense.
Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle or some such phrase I may have previously seen wrt them, I dont remember now, I havent looked at them, if it is something like that you are talking about, I have to look at it.
Be it general LLM or coding related LLM and you are talking about it filling some blanks in the middle of a statement the user has entered, then I assume, user will put some special tokens in the middle of their prompt, in which case the user message will have to be tokenized using parse_special, if that is what you are talking about, then maybe a cmdline argument can be added to inform the logic, whether to treat user message as a normal text or has potentially including special token related tags
Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle
Yes, this is what I meant. One of the models (that I know of) that's capable of infill is the Refact model. Sorry if I caused confusion or made assumptions.
Updated notes
Overview
Helps chat with a model, by allowing role based special token tagging, based on the specified chat-handshake-template-standard. This is used by main, to build on existing interactive flow and its in-prefix, in-suffix and antiprompt/reverse-promot
-
Use a json file to configure the needed tags for each of the supported chat-handshake-template-standard
a. system -> prefix & suffix,
b. user -> prefix & suffix, assistant -> prefix
- [main] these override the in-prefix and in-suffix
c. reverse-prompt
- [main] this adds to any reverese-prompt specified using cmdline
d. global -> begin & end
e. systemuser-1st-user-has-prefix
-
if a combination of system and user messages/prompts is passed, then for the 1st user message following the 1st system message, include user prefix only if this flag is set. [chaton-tmpl-apply]
-
[later] one or two models which I looked at seem to require not just BoS, but also the user-role-prefix-tag to also be controlled wrt this case. So not differentiating between BoS and any user-role-prefix-tag. However if bos and user-role-prefix-tag need to be decoupled, where only bos needs this treatment, then maybe add begin and end keys (to specify the BoS) in addition to prefix and suffix keys (to specify user-role-prefix-tag), to role blocks in the json, and inturn control only begin and not prefix, wrt whether to add or not.
-
[main] currently the user specified system prompt (-p + -f) is tagged using system role tags, and inturn this tagged message is tokenized with parse_special flag. So any special token related tags in the user specified system prompt will get parsed as special.
-
chaton-tmpl-apply uses the json file, which was loaded, to decide on how to generate the tagged messages for tokenisation.
a. input: [ { role, message }, { role, message}, ....]
b. output: currently a single string is returned which contains the tagged message(s).
[later] if it is needed to differentiate between the special tags added by this from user specified prompts/messages, then return [ {flag, data}, { flag, data}, {flag, data}, ....], where the flag specifies whether parse_special should be used or not for the corresponding data, during tokenization.
Adding support for new model / chat-handshake-template-standard
-
Add suitable entries in json for that model/standard
-
Update the flow in chaton-tmpl-apply, as needed.
Try to update and or reuse the generic flow in chaton-tmpl-apply, as much as possible, before trying to add a custom logic. If you update the generic flow, cross check if existing json files will need to be updated or not.
Notes
Currently Main doesnt use chaton-tmpl-apply, but only
- chaton-tmpl-apply-single (for system prompt) and
- chaton-tmpl-role-part which maps the user prefix, suffix and reverse-prompt to in-prefix, in-suffix and antiprompt of main. These always adds any role specific prefix and suffix around the passed message.
Sample chaton_meta.json includes template info for
- llama2
- llama3
- gemma
- chatml
- zephyr
- deepseek
I noticed some difference between deepseek's actual tokenizer config and what is there in llama.cpp's chat-apply-template, so for my logic, I have added two entries deepseek-alt (which matches existing llama.cpp tempalte) and deepseek (which matches role related tags and eos from tokenizer_config.json). However both will potentially work.
Later need to cross check the tokenizer_config.json of the other models, with what I have put in chaton_meta.json, to see if they are in sync or not. However based on minimal testing of these models, the existing template in chaton_meta.json does seem to work.
NOTE: Even if there is some difference in EoS specified using reverse-prompt, chances are the default logic in main already looks for the EoS specified in the model file loaded also, so things should still be fine, even if the json doesnt match the one in model.
Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle
Yes, this is what I meant. One of the models (that I know of) that's capable of infill is the Refact model. Sorry if I caused confusion or made assumptions.
In middle of somethings, but later will try look into this, as well as add cmdline option to control whether user prompt is parsed wrt special tokens or not.
Have added support for Begin and Prefix entries wrt User role and inturn one can configure both of them individually wrt whether either of them get added to the 1st user message following the system message from chat-template-apply perspective, in commons/chaton.hpp.
Look at llama3, llama2 and monarch entries in the examples/chaton_meta.json wrt how things can differ wrt begin and prefix and inturn 1st user msg following system message.
At first lance, I'm not sure if it's a good idea to move the implementation completely into a separated JSON. While the good point is that it allows users to edit the list of templates easily, it brings some problems:
- This API is now outside of
llama.hand can't be use in other examples - It depends on
json.hppwhich is again, not part ofllama.h
Also, could you do a test implementation with the examples in tests/test-chat-template.cpp? Automate test will make the PR easier to follow (and to verify if the idea works)
At first lance, I'm not sure if it's a good idea to move the implementation completely into a separated JSON. While the good point is that it allows users to edit the list of templates easily, it brings some problems:
- This API is now outside of
llama.hand can't be use in other examples- It depends on
json.hppwhich is again, not part ofllama.hAlso, could you do a test implementation with the examples in
tests/test-chat-template.cpp? Automate test will make the PR easier to follow (and to verify if the idea works)
@ngxson for now I purposefully kept the new flow outside llama.h and within common/chaton.hpp for these reasons
-
As this is currently still an experimentation to cross check this mechanism can work in general across models, as well as wrt web-service/server flow as well as a normal cmdline application flow (main). so didnt want to step on the current llama-chat-apply-template, till this is validated.
-
As I had mentioned, I have a fundamental issue with current llama-chat-apply-template api, in that it merges user prompt and chat-handshake-template-standard tags/special-tokens into a single string, which inturn will be tokenized with parse_special flag, which would allow a user to override/modify system prompt or other aspects from under the normal handshake by inserting special tokens into user prompt, etal. While ideally it should be configurable, ie in some cases you may want this flexiblity and in some cases, you wont want this flexibility.
And inturn providing flexibility wrt (2) would either way require adding a new api wrt chat template applying, while potentially retaining the old api through potentially a wrapper over a more flexible newer api.
As a possible step towards that more flexible flow, on experimenting towards same, I have added the initial skelton of ChatParts class in common/chaton.hpp (Note: I have been away from C++ for 1.5-2-decade++ now, and jumped through too many languages which were at much lower or similar or higher abstraction compared to c++ over the years, so my c++ memory is only so and so, and I have depended more on the compiler not warning/erroring out and not stricitly throught through from memory mgmt perspective of the new classes in c++ etal, so there could be some inefficiencies and or gotchas in there).
Also if it makes sense to expose a more flexible api to differentiate between special-tokens++ parts and user provided parts in the formated/tagged string, then whether to expose it through the C only hop in llama.h by using
- a single string + array of start-end indexes which relate to parts of string which either relate to special-tokens parts or the other part
- or return a array of strings and a additional string with each char indicating how to treat each of the individual strings in the returned array
- or ...
Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself, any commandline argument passed by user as well as potentially set from a chat-template manager like chat-template-apply driven logic. The reason is because if I am not wrong, some of the models may allow more than 1 chat-handshake-template-standard, in which case the model file may not explicitly provide all of the possible EoS/EoG token(s) across all of their supported standards. So retaining the antiprompt vector provides flexibility for multiple levels of intervening like what I mentioned.
This was originally a weekend project, to solve a immidiate issue I had at my end. And later to see if there can be a generic flow, which can be ideally modified and or extended in future for models/standards which follow a sensible convention, without needing to modify code. And the skeleton which I have added in chaton.hpp seems to provide that for the 5 to 6 models which I tested at a minimal level (ie few back and forth handshake using main interactive flow augumented with my logic/PR) and the corresponding entries added to chaton_meta.json in my PR.
I glanced through test-chat-template.cpp, but I feel currently it uses a vector of chat templates from models or ... without identifying the individual templates explicitly, like through a map instead of a vector or so, thus requiring to manually map each template with the chat-apply-template code to see what model/standard it may be mapping to. I will see if I can create a duplicate file which uses this alternate chat-template-apply logic, after I have flushed out Chatparts a bit more.
Also as json library seems to be co-opted into llama.cpp/common, so I used the same and built my concept on top of it actually.
If the logic in this PR works out in handling most of the sensible model/standards out there using a generic flow, and inturn if there is interest in this PR, then may be we can avoid json and replace it with a simple 1-level heirarchy text file something like below and simple parser for it.
Template-id1 \t key1: "value" \t key2: true|false \t user-prefix: "line 1 content \n line 2 ...\n line 3" \t user-begin: "value ..."
Template-Id2 \t key1: value4k1 \t key2: value4K2
....
I've just have a look in detail for this PR. The idea seems ok (i.e. using input_prefix/input_suffix/antiprompt), but I still find the implementation is quite complicated IMO:
- I still prefer not to rely on JSON, since it makes the compiled binary not very portable
- Not sure how we can handle conditional prefix/postfix, for example llama2 with/without system message
- The code has some level of abstractions, for example
ChatPartsthat does not fit very well with the code style of llama.cpp. It makes me feel a bit like the logic is designed for higher-level languages like python
Also I don't really understand the differences between this PR and #6822 , as I'm also trying to implement a system of prefix/postfix for chat templates. Can you explain this a bit more?
Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself
The problem is that all the new chat templates are moved away from antiprompt. They're all using special token to stop generation. This will still be true for all future models, so I don't think antiprompt is something that is future-proof (but special tokens are)
@ngxson hope below gives some more background and or info on the idea behind this PR
I've just have a look in detail for this PR. The idea seems ok (i.e. using input_prefix/input_suffix/antiprompt), but I still find the implementation is quite complicated IMO:
- I still prefer not to rely on JSON, since it makes the compiled binary not very portable
Based on further experimentation, if it is found that a good number of chat-handshake-template-standards can be driven using a config file (json/...), then as I had mentioned in previous comment, we could look at a simple text file based config file instead of json, so that the code can be portable, without depending on a seperate json library.
- Not sure how we can handle conditional prefix/postfix, for example llama2 with/without system message
If you are talking about if there is a system+user message one kind of tagging is required and for user only message a different kind of tagging is required, using systemuser-user-1st-has-begin/prefix flag in the json file, I have tried to handle the difference in tagging across many models/standards. However I agree that there may be few more variations when looked across multiple models/chats. I am looking at a more detailed (in terms of fields) json to see if more combinations can be covered, and that too without adding more custom flags. Maybe tomorrow I will give it a shot.
However do note that if we are looking at a pure main program based chatting, yesterdays simple json and corresponding logic already allows chatting with around 5 to 6 models which I have tested yesterday. However wrt server/web-service related flow, I need to cross check with the more detailed json, because some more variations come into picture.
- The code has some level of abstractions, for example
ChatPartsthat does not fit very well with the code style of llama.cpp. It makes me feel a bit like the logic is designed for higher-level languages like python
If you read my previous comments, as I had mentioned, ChatParts is more to help keep the different parts that make up a tagged message/chat seperate so that additional data can be extracted to tokenize in a more fine grained manner. However at the same time to allow exposing the api interface over a standard c-extern gating, instead of ChatParts, its helpers can be used to expose the additional info using a array of chars and array of ints. My todays commit already has this mechanism implemented, do have a look, to see, what I mean.
You will see that people who want to follow the old api related flow of working with a single tagged string as is, they can do that, at the same time additional info is exposed, if they want to tokenize user prompt parts different from the tags parts.
Also I don't really understand the differences between this PR and #6822 , as I'm also trying to implement a system of prefix/postfix for chat templates. Can you explain this a bit more?
If I am not wrong, you are looking at implementing the prefix/postfix using hardcoded tags in the code, while this PR tries to see if the needed functionality can be achieved by using a combination of json/text based config file + code, with the idea being to try allow end users to manipulate tagging to some extent, without needing to recompile things. As well as try allow new modes/standards to be supported using a generic flow where possible.
ALERT: This is still a experiment, I need to cross check this bit more, before I can categorically say that this can handle most common combinations or not.
Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself
The problem is that all the new chat templates are moved away from antiprompt. They're all using special token to stop generation. This will still be true for all future models, so I don't think antiprompt is something that is future-proof (but special tokens are)
Rather you seem to have looked at only a part of my comment, if you read the para fully, you will see the reason why I have suggested to retain the current flexible antiprompt mechanism and then to add the EoS from the model file into the antiprompt flow itself ie by inserting the EoS info in the model to the antiprompt vector.
By more detailed json/text config file to try support more combinations parallley without too many flags, what I am thinking of (need to cross check)
-
global-> begin & end
-
system -> begin, prefix, suffix & end
-
user -> begin, prefix, suffix & end; assistant -> begin, prefix, suffix & end
- [main] these override the in-prefix (begin+prefix) and in-suffix (suffix+end)
-
reverse-prompt
- [main] this adds to any reverese-prompt specified using cmdline
-
systemuser-sys-has-suffix, systemuser-sys-has-end, systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix
-
[chaton-tmpl-apply] if a combination of system and user messages/prompts is passed, then for system messages suffix and end, as well as for the 1st user message following the 1st system message, include system suffix and end and user begin and prefix only if corresponding flags is set.
-
begin should normally relate to BoS while prefix should relate to Role Identifier tag. If there is no need for seperate handling of BoS and RoleIdTag, then one could even set both BoS and RoleIdTag to one of these entries itself.
-
This is still just a initial idea in my mind by looking at few jinga files, I need to think through and try out this detailed fields based flow still. However the existing simpler json and corresponding support added to drive main's in-prefix/suffix does work for main based chatting. Its the server/web-service kind of flow, where this more detailed fields based flow needs to be thought through bit more and cross checked.
Also the idea is to try and see if a common generic logic can be used to drive templating for many models/standards, while still providing the flexiblity to hardcode in code if required for specific models/standards.
@ngxson have a look at the latest commit here, using a simple generic logic (which you can checkout in chaton_tmpl_apply_ex function) and a json file containing the details of the chat-template in a simple and detailed way, this logic tries to allow tagging of the messages across different models/template standards.
For around 9 models/chat-handshake-template-standards I have included sample json config in examples/chaton_meta.json
- llama2, llama3, gemma, chatml, zephyr, deepseek(normal and coder), monarch, mistral
The c-api which follows a similar semantic as the previous llama_chat_apply_template, is available in the common/chaton.hpp.
As the models for which I have added sample tempate config info and inturn checked using modified main, is bit different from those in test-chat-templates, so I have add a new test-chat-template-chaton.cpp to tests folder, which if you run will show what will be the tagged messages wrt the 9 models which I have mentioned above, so that you can check if what it generates is similar to what you may be expecting or not.
I feel this mechanism of a generic flow driven by a json is vaible in general, based on the models which I have tested against. And either way, if a particular model requires a very different structure beyond what can be generated by the generic logic, one can always add custom code into chaton_tmpl_apply_ex. This should allow supporting new models in many cases, by just adding to the json config file.
Also do go through the detailed Notes/Comments at the begining of the common/chaton.cpp to get a rough feel about this code.
I understand the high level idea but sorry I really don't have time to look at the detailed implementation. While it's a good idea, IMO the chat template infrastructure should be kept simple and support for customizable formats can be added later on. Maybe we can keep your PR as a demo and we will see if it can be merged in the future.
Also for context, there's already a discussion on chat templates in the beginning of server development. You can have a look here: https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-1829911139
I understand the high level idea but sorry I really don't have time to look at the detailed implementation. While it's a good idea, IMO the chat template infrastructure should be kept simple and support for customizable formats can be added later on. Maybe we can keep your PR as a demo and we will see if it can be merged in the future.
Also for context, there's already a discussion on chat templates in the beginning of server development. You can have a look here: #4216 (comment)
Hi @ngxson, @ggerganov
generic code flow + config file based template handling
please do have a look at the implementation, the generic flow is actually very simple yet flexible, and I feel this idea can accomodate many different models / handshake standards by just updating the config file without touching the code flow (there could be some small differences in terms of white spaces in the worst case, which potentially may not matter beyond a limit, even that may be handleable by adding some generic flags like trim content or so, but it may make it unnecessiraly too detailed a control).
I have tried to add support for 8(+1) different models/standards in examples/chaton_meta.json, all by using the generic flow itself, without requiring any customization in code to accomodate that specific model/standard. At a initial glance the tagged messages seem ok to me, but it would be useful for someone else to also cross check once to be sure.
To test wrt main one needs to use
- bin/main -m path/to/llama3.gguf -i --chaton-meta-json ../examples/chaton_meta.json --chaton-template-id llama3 -p "You are a monster " -f ../prompts/chat-with-bob.txt
To test the possible server flow related multi message tagging at once
- bin/test-chat-template-chaton ../examples/chaton_meta.json
This PR specific code is in common/chaton.hpp and inturn the generic logic which uses the config file to do the tagging is in the function chaton_tmpl_apply_ex. You will notice that the generic flow basically just builds on the basic pattern used by most models/standards in a simple and straight forward way, without much complexity.
It also provides the basic plumbing for differentiating between the user provided parts and the handshake template provided parts in the tagged message, so that in future, if required the tokenisation can be controlled interms of using parse_special for the template provided parts and avoiding parse_special for end user entered parts ie their querys during chatting, if needed. This is currently not exposed in the c api.
Because this config file + associated generic flow tries to expose all parts of the generic pattern followed wrt all the 3 roles, so anyone wanting to experiment with different templates for a given model, will also be potentially able to do that, by just updating the config file. Unless one is doing some fancy conditional inserting of tokens etal beyond the basic system+1st-user-message related one which I have seen in the 8(+1) models, that I have checked to some extent. In which case they will have to add custom code, like what they would have done even now in the existing flow. (this partly relates to a query/comment I noticed wrt PR #4216)
simple text based config file
As you had noted a concern about this potentially making the users of the core llama.cpp needing to bring in json library, if this config file based flow is used, I have added a simple text based config file logic, to try and avoid the dependence on json, while still giving sufficient flexibility for this use.
The code for the same is in common/simpcfg.hpp
and the sample simpcfg text based config file is in examples/chaton_meta.simpcfg
NOTE: currently I have not updated the chaton.hpp to use this simpcfg based files instead of json files. If all of you find that the chaton generic flow is doing what is expected in a sufficiently proper enough way, and inturn that, it is better to avoid needing json dependency wrt 3rd party users of llama.cpp as a library, then I can look at replacing the json (picked from what was already in common dir) with this simpcfg based flow.
Note
Do have a look at the note in the chaton.hpp for uptodate overall flow and reasoning. For now the 1st note in this PR conversion, is updated to match the note in chaton.hpp.
Also I agree that lets not look from merging angle yet, only after both of you and any others with knowledge that you want to look at this flow, have gone through it and find that it seems to be ok and flexible enough, we can look at merging
NOTE: I am a casual (non-regular) user of LLMs as well as llama.cpp, so dont have that much experience with it beyond basics, but I feel if this idea works out, as I feel it seems to currently, then in future for many new models/chat-handshake-template-standards if they follow a sane generic pattern as many seem to be, then the generic flow itself will be able to support those, by just updating the config file, without needing to modify code and recompile it. However I need eyes from experienced users and developers of llama.cpp like you to cross check, if what I am seeing with my limited testing actually makes sense.
NOTE: If new models/standards follow a sane pattern, then other than updating the config file, the only change that may be required in code, is in tokenizer wrt any new specifial tokens that they may have added or different encoding for existing special token tag or so, ie if there is no generic way to pick this info across models from their model file. This is a logical guess based on my limited knowledge of llama.cpp and llms in general.
Updates wrt
SimpCfg
- Provide logic for more proper trimming wrt more languages by converting to wstring, disabled by default
- switch to variant and templates based logic, to avoid duplication, as well as to allow for easier new data type additions if required in future.
- support for vectors/arrays of supported data types in a verbose/expanded out way
- ensure true/false bool is case insensitive when loading for the text based simple config file
- avoid maintaining enclosing double quotes from the config file wrt string values
ChatOn
- add support for Phi3 model to chaton_meta.json
Have rebased and force pushed wrt latest upstream.
Had to resolve a conflict wrt updating the tests/CMakeLists.txt due to change of convention in CMakeLists.txt from upstream
Have added command-r model's chat template to chaton_meta.json
In case users of command-r model, @khimaros, https://github.com/abetlen/llama-cpp-python/pull/1382, are interested
Have added chat template info wrt Orion(Star), Vicuna and OpenChat into chaton_meta.json
@ngxson @ggerganov
Now I have added equivalent chat template config into examples/chaton_meta.json wrt all the models currently hardcoded in llama.cpp-chat-apply-template. Except for the last 3 to 4 models which I added this week, for all other models I have tested by using my modified main flow in this PR and it seems to be fine.
This PR follows the logical pattern seen across models and inturn implements the same has a generic code flow which is inturn driven using a text based config file, which inturn logically seems to be good enough to satisfy all the currently chat-template supported models in llama.cpp.
And chances are even for newer models, this generic code should satisfy their need by just updating chaton_meta.json, without needing any hardcoding in the code or code recompiling, unless a new model changes the pattern used in some odd ways, which doesnt seem to be the case till now with the models that had hardcoded chat template currently.
Also if one looks at the code in common/chaton.hpp wrt the chat template generic code flow and inturn compare with what is there in the currently hardcoded llama.cpp-chat-template-apply, I feel this new code flow is more cleaner and at same time equally simple to understand, and also most probably shouldnt need changing also in general for newer models.
Additionally, even thou currently I have only patched main to make use of this generic-code-flow + text-config-file flow, I have provided the c-api wrapper to match existing chat-apply-template also, in chaton.hpp, so it should be easy to test this generic-code-based-flow wrt any code which uses the default chat-apply-template api. The actual code flow is flexible to allow all of below from the same config file and building on the same code blocks.
- querying for individual role's begin or prefix or suffix or end
- application of chat template for a single message or
- application of chat template for a bunch of messages.
thus allowing its use across be it main based chat, server based chat or any other use cases.
I feel, you both should look at this once more, and test it out, also think through the simplification possible wrt this use case with this PR, if the test results are ok, and then take a call. However given that you both are long term developers and maintainers of llama.cpp, compared to a once in a bluemoon user/developer that I am wrt the same, so the final call is for you all to take.
Also as previously mentioned, if you feel you dont want to burden users of llama.cpp library with json dependency, then I have created a very simple yet flexible enough text based config file format and its parser in common/simpcfg.hpp. If required the code in chaton.hpp can be easily updated to work with this simple text based config file, thus avoiding the need for json also.
Also note that the way the multi message chat templating is implemented in this logic, it keeps track of the boundries between what the user passed during chatting and what the chat template logic added, so that in future, if one needs to apply different flags (like parse_special) wrt these two different sets of data in the tagged chat messages, then the same can be easily supported.
Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle
Yes, this is what I meant. One of the models (that I know of) that's capable of infill is the Refact model. Sorry if I caused confusion or made assumptions.
Hi @teleprint-me , Forgot to mention this before,
Has the flow currently applies parse_special wrt user added text/prompt also, so technically any special tokens in them should get handled appropriately to allow fill - in by the models.
At the same time the underlying logic does keep track of the user provided text and the template logic added text, so that in future, if anyone wants to selectively disable special token parsing wrt user provided text, it can be achieved.
Rather there are use cases, which require special token parsing of user text and there are use cases which dont want user text special token parsing, so the plumbing is there for both in this PR. The C++ api exposes both possibility, while the C-api provided currently doesnt expose the differentiating of the tagged chat message subparts, bcas the current default c-api for same in llama.cpp doesnt allow for same. However a new C-api wrapper can be added for the same easily, bcas the c++ api has been developed keeping the need for a c-api wrapper in future, in mind.
UPDATE1: Sorry I think I realise, what you have mentioned, rather the chat-template-apply and main-chat-interactive have slightly different behaviour, I forgot about that, in the middle of other things. Even thou the c-api for template applying doesnt currently differentiate between subparts of the chat, the default chat-interactive flow of main doesnt enable special token parsing wrt user text, I will add a option for controlling this wrt main chat interactive mode.
UPDATE2: I have created a seperate pull request to allow for special tokens in user text in interactive mode, so that it is independent of this PR involving chat-templating flow changes.
https://github.com/ggerganov/llama.cpp/pull/7097
this seems like a great contribution! do i need any special flag incantation to enable the new behavior? should this work automatically for OpenAI clients (eg. BetterChatGPT) against the server binary built from your branch with default make vars?
Hi @khimaros, @ngxson, @ggerganov
Using wrt examples/main
Currently examples/main has been updated to use this code. To use this with main, one needs to pass
bin/main -m PATH/LLM.gguf --interactive --chaton-meta-json ../examples/chaton_meta.json --chaton-template-id THE_ID -p "a sample system prompt"
The template id of the model/chat-template standard can be picked from the template ids in the chaton_meta.json file, for example for Microsoft's phi3 it is phi3
Other usecases including examples/server
For more generic usecases, I have added a c-api wrapper for this logic, which provides a api similar to that provided by the existing llama_chat_apply_template, it is in common/chaton.hpp chaton_template_apply_capi.
NOTE: One difference between existing api and this, is that this new api requires one to pass the template-id rather than the chat-template jinja equivalent line from the model. While the existing api allows one to pass a template-id (I have kept the ids almost same as the existing code, except for deepseek and vicuna, the reason is in the corresponding commit messages, when the template for same was added to chaton_meta.json) or the jinja line from the model, to help identify the model.
There was a feedback from @ngxson (the exsiting chat-template and inturn I assume related server author), that because this new chat-templating flow
-
uses json, and inturn because the llama.cpp project may not want to add a dependency for json into the core chat templating path for the llama.cpp (when used as a library) as well as
-
adds more code in the form of this PR's generic code flow,
so we shouldnt go ahead with this PR for now, but may be we can revisit it at a later date, so I havent looked too much wrt server currently,
inturn examples/server
if one
-
modifies the server code to initialize this new code and inturn pass the template id rather than jinja line from the model, as well as
-
renames/comments out the existing llama_chat_apply_template c api and inturn renames in this new chaton_tmplt_apply_capi as llama_chat_apply_template,
then chances are one should be able to make the existing examples/server work with this new flow potentially
JSON - SimpCfg - Examples - llama.cpp library
Given the concern mentioned with JSON, I had implemented a simple text based config file format in common/simpcfg.hpp, to check if that can help resolve that concernt, but discussion hadnt progressed wrt same earlier.
However @ggerganov has responded sometime earlier today indicating that given that examples already use json, so no need to look at a different simpler text config format, potentially. Gerganov, when you mentioned this, were your thoughts specific to examples or is it more generic and inturn also relates to when llama.cpp is used as a library by other (external) projects also
answering my own question: it seems this hasn't been incorporated into the server yet. seems there's another branch for that.
i'm testing out the command-r chaton configuration using the following incantation:
./main --temp 0.7 --repeat_penalty 1.1 --model ./models/c4ai-command-r-v01-Q6_K.gguf --ctx_size 4096 --threads 8 --n_predict 2048 --color --interactive --file /tmp/llamacpp_prompt.enm73Qj.txt --reverse-prompt USER: --in-prefix ' ' --chaton-meta-json ./examples/chaton_meta.json --chaton-template-id command-r
with the following in the prompt --file:
This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision
i have the following observations:
- previously, it was idiomatic to add
USER:as the last line of the prompt string. this is no longer needed. - it is no longer necessary to provide a reverse prompt of
USER:nor an input prefix ofin order to control the flow - seeing all of the tokens in the chat log is a bit disorienting and less immersive than seeing
USER:/ASSISTANT:as it is in master branch. i wonder what the tradeoffs would be of hiding the tokens from the user? - i'm seeing spurious whitespace and
>just before the input prefix token
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|END_OF_TURN_TOKEN|>
> <|START_OF_TURN_TOKEN|><|USER_TOKEN|>
Gerganov, when you mentioned this, were your thoughts specific to examples or is it more generic and inturn also relates to when llama.cpp is used as a library by other (external) projects also
The json should be used only within the examples of llama.cpp, not by the actual library. I.e. the source file llama.cpp should not include json.hpp
Hi @khimaros,
This patch auto sets the example/main's in-prefix/suffix as well as antiprompt/reverse-prompt from the equivalent configuration data in the specified chaton_meta.json file, that is the reason, why its no longer required to be explicitly specified.
The extra "\n> " you are seeing is the only-visible-to-end-user prompt added by the existing main code, as I reuse/extend the existing main flow, you see the same.
answering my own question: it seems this hasn't been incorporated into the server yet. seems there's another branch for that.
i'm testing out the command-r chaton configuration using the following incantation:
./main --temp 0.7 --repeat_penalty 1.1 --model ./models/c4ai-command-r-v01-Q6_K.gguf --ctx_size 4096 --threads 8 --n_predict 2048 --color --interactive --file /tmp/llamacpp_prompt.enm73Qj.txt --reverse-prompt USER: --in-prefix ' ' --chaton-meta-json ./examples/chaton_meta.json --chaton-template-id command-rwith the following in the prompt
--file:This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precisioni have the following observations:
- previously, it was idiomatic to add
USER:as the last line of the prompt string. this is no longer needed.- it is no longer necessary to provide a reverse prompt of
USER:nor an input prefix ofin order to control the flow- seeing all of the tokens in the chat log is a bit disorienting and less immersive than seeing
USER:/ASSISTANT:as it is in master branch. i wonder what the tradeoffs would be of hiding the tokens from the user?- i'm seeing spurious whitespace and
>just before the input prefix token== Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|END_OF_TURN_TOKEN|> > <|START_OF_TURN_TOKEN|><|USER_TOKEN|>
Hi @ggerganov @ngxson @teleprint-me @khimaros @mofosyne
The initial/previous version was rooted around a json object, while the new version is rooted around a MapOfMapOfVariant (GroupKV), which could be preloaded with chat templates info at compile time itself and used as is. Or optionally one could allow the configurable template data to be extended/updated at runtime from a text(/SimpCfg)/json file.
Thus this new flow should allow for using the new chat templating logic without needing to load additional data at runtime, if one doesnt want to, thus also avoiding need to bring in common/json library.
At the same time for a use case like examples/main where it is useful to allow the user to either change the existing (pre/compiled-in) template info and or try adding support for new models/finetunes/template-standards, the same can be achieved by loading it from json file.
Optionally in some use-cases, if one wants the runtime augumenting capability but still doesnt want to bring in the common/json, then one could optionally switch ChatTemplates to use SimpCfg (which builds on GroupKV) and inturn use its load logic to load from a simple text file.
The Notes in common/chaton.hpp has been updated to capture the new mechanism.
Currently by default CHATON_JSON (which brings in json based loading) as well as GKV_DEBUGLOG_ON (which makes the logic more log verbose) is enabled, which needs to be disabled. Rather as I was writing this, come to think of it, I need to move the CHATON_JSON block into its own file, so that the library by default can be compiled without needing json and inturn only programs which use it like main can include the new file with this json based loading helper.
NOTE: The compile time pre/compiled-in configurable template data is picked from chaton_meta.hpp. There is a simple and stupid minded python helper added to scripts to convert from chaton_meta.json to chaton_meta.hpp.
NOTE: Currently I have not updated the code to follow some of the naming/coding convention mentioned.
Hi @ggerganov @ngxson @mofosyne
Just to give a rough context, for a code using the existing chat template logic like examples/server, a simple change like below will allow it to use the new chat template logic from this PR. Once the code+setup is updated to exist in llama(.cpp) library rather than common library. Along with specifying/passing the tempalte-id rather than passing the ninja template string/... wrt the template argument (I have hardcoded to llama3 below).
I have done a crude transplanting of chaton into llama.cpp in the below repo, in case if anyone wants to test it. Note that this doesnt integrate chaton into llama library in a proper way, nor does it take cmdline argument wrt template-id etal.
https://github.com/hanishkvc/experiment-ai-tools-llama.cpp/tree/hkvc_chaton_v3_crude_server_v1
` diff --git a/llama.cpp b/llama.cpp index 7d26966e..a72da101 100644 --- a/llama.cpp +++ b/llama.cpp @@ -17983,6 +17983,8 @@ static int32_t llama_chat_apply_template_internal( return dest.size(); }
+#include <chaton.hpp> + LLAMA_API int32_t llama_chat_apply_template( const struct llama_model * model, const char * tmpl, @@ -18014,7 +18016,8 @@ LLAMA_API int32_t llama_chat_apply_template( }
std::string formatted_chat;
- int32_t res = llama_chat_apply_template_internal(curr_tmpl, chat_vec, formatted_chat, add_ass);
- //int32_t res = llama_chat_apply_template_internal(curr_tmpl, chat_vec, formatted_chat, add_ass);
- int32_t res = chaton_tmpl_apply("llama3", chat_vec, add_ass, true, formatted_chat); if (res < 0) { return res; }
`
NOTE: Inserting code doesnt seem to work properly wrt the comment's include code mechanism or so. Or I dont understand its proper use. So the above may look bit odd.