mlx-examples
mlx-examples copied to clipboard
chore(mlx-lm): enable to apply default chat template
The LLM community seems to be moving towards the standard chat template, with many models starting to use the default chat template. Enabling the default chat template for mlx-lm may be more beneficial. Additionally, users can always manually disable the chat template using the --ignore-chat-template flag.
@awni, please feel free to close this PR if you feel it does not fit the purpose.
Yeah, the chat template will fallback to the default chat template. The reason for originally checking the chat template was to avoid using the default because it may generate unexpected output. FYI: https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/tokenization_utils_base.py#L1797 https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/tokenization_utils_base.py#L1736-L1740
Thanks!
Indeed you have a point, and I see we all converging to a defacto template.
However, what happens when it's a pretrained model without any instruction-tuning? Won't defaulting (fallback) to chatML produce bad results?
In that case, you may need to use the --ignore-chat-template flag, which makes sense given there are more use cases in instructed models than base models.
Makes sense to me :)
As a ML Engineer this is intuitive.
I'm just thinking about UX for the users at large that might understand distinction.
They could just run generate and have the abstraction figure out the template formatting if any.
Just curious, what cases does this fix that currently do not work? Don't the instruction tuned models have the template in the tokenizer? This would fix the commandR issue for example where it did not include the chat template?
Just curious, what cases does this fix that currently do not work? Don't the instruction tuned models have the template in the tokenizer? This would fix the commandR issue for example where it did not include the chat template?
Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.
Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.
Oh, I get it now. That’s the behaviour I was expecting.
Could you point for a model with such a configuration? I would like to check it out.
Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.
Oh, I get it now. That’s the behaviour I was expecting.
Could you point for a model with such a configuration? I would like to check it out.
Yeah, for example explicitly in the chat template configuration:
https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-6.7B/blob/main/tokenizer_config.json#L181
For the c4ai-command-r-v01 model, the tokenizer only defined the default chat template: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenization_cohere_fast.py#L193
@mzbac Got it,
It works with both pre-trained and instruction models. 👍🏽
But I noticed that with starcoder2-3b it gives inconsistent results.
python -m mlx_lm.generate --model mlx-community/starcoder2-3b-4bit --prompt "Write a addition function in python" --colorize --max-tokens 10
Current implementation:
==========
Prompt: Write a addition function in python
which takes 2 numbers as input and returns the
==========
With your change:
==========
Prompt: Write a addition function in python<|endoftext|>
<fim_prefix><fim_suffix>ue_comment>username_0:
==========
Additionally, with your change the tokenizer for some reason adds a eos token to the prompt.
Found the issue, without the condition to check tokenizer.chat_template is not None, the condition becomes True which trigger this warning:
No chat template is defined for this tokenizer - using the default template for the GPT2TokenizerFast class. If the default is not appropriate for your model, please set tokenizer.chat_template to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.
And that's how you get the eot token.
==========
Prompt: Write a addition function in python<|endoftext|>
@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?
@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?
It may not really be an issue because the Starcoder is not a chat model. Users have to disable the chat template by using --ignore-chat-template.
🤔 so if I understand correctly, this PR will improve the default setting for some models (by using the default chat template) but for other models it might be worse because the default chat template has issues.
I'm not really sure how to decide whether to land it or not. So far it seems a bit of an edge case that a model expects to use the default template but doesn't have it set, is that fair to say? Maybe the right way to do this is to have a --use-default-template flag or something like that?
I might be wrong on that assessment, I think we should make the default the more common case though, any input on that would be useful.
@awni, yes, it used to be an edge case where the model expected to use the default template. However, recently models seem to be moving towards using the default chat template. The HF tokenizer.apply_chat_template is showing a warning when applying the default template, so users should be aware that the default template is being used and they can manually disable it. I am happy to close this PR for now if we still feel that this is an edge case. Maybe we can wait for a while to see how the community moves, and then decide which is the most common use case.
The HF tokenizer.apply_chat_template is showing a warning when applying the default template
Nice, that's useful to know!
I am happy to close this PR for now if we still feel that this is an edge case. Maybe we can wait for a while to see how the community moves, and then decide which is the most common use case.
I would suggest we do one of two things:
- Give an option to use the default if desired (right now it's not possible if the template is
None). - Keep it this way and keep the option to shut off the default template if desired.
I really couldn't say which is a better choice. I have a slight preference to merge what you have, see how it goes and reassess if it becomes an issue. It seems like we are providing useful output even when the default is wrong. But we could also switch it to 1 and do the same. Wdyt?
Personally, I feel that choosing option 1 may be better since it is more explicit. If you don't mind, I can update the PR to go with option 1.
Of course, sounds good to me! Thank you!
@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?
I was a unwell but I'm back.
@mzbac's last commit (df1eb23) addresses the issue I reported.
@awni and @mzbac I would like to point out that chat_template is the new standard and default_chat_template exists only for backwards compatibility. It's recommend to override the default_chat_template by setting chat_template and not the other way arround. According to HF, default_chat_template are likely to be dropped in favor of chat_template [1].
I could be wrong but IMO we should be patching the tokenizers on the MLX community hub and of future converted models by setting the chat_template for the ones using default, instead of using the flag during inference.
Here is summary from the HF article discussing templates [1]:
Prior to chat templates, chat handling was hardcoded per model class. For backwards compatibility, these class-specific handlers are now default class-level templates. If a model lacks a chat template but its class has a default template, the default template is used instead.
However, explicitly setting chat_template is recommended, even when the class default is suitable, to clarify chat configuration and future-proof against potential default template changes or deprecation.
You can read more here:
[1] Templates for Chat Models
- https://huggingface.co/docs/transformers/en/chat_templating#advanced-how-do-chat-templates-work
- https://huggingface.co/docs/transformers/en/chat_templating#what-are-default-templates
The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models. This really depends on the community adoption of the chat_template. We may just have to wait and see.
The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models.
Could you elaborate? I meant patching tokenizers already in the MLX-community hub only and any new mlx-lm supported model.
This really depends on the community adoption of the chat_template. We may just have to wait and see.
Agreed!
The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models.
Could you elaborate? I meant patching tokenizers already in the MLX-community hub only and any new mlx-lm supported model.
This really depends on the community adoption of the chat_template. We may just have to wait and see.
Agreed!
Yeah, the mlx-lm can directly load HF models. So if the original model has not been patched and loaded via mlx-lm as expected using the default chat model, it may behave differently than the model we patched in mlx-community.
Yeah, the mlx-lm can directly load HF models. So if the original model has not been patched and loaded via mlx-lm as expected using the default chat model, it may behave differently than the model we patched in mlx-community.
I see your point!
At the moment, the default behavior for those models is to skip the templates and pass the prompt directly, right?
If so, are the any examples of models that behave strangely with this current setup?
TL,DR: We don't need this change, but if we decide to add it we better add:
- A warning that
default_chat_templatemight produce wrong results - A condition to check if
default_chat_templateexists because it might be deprecated in the future.
I ran some tests and even non-chat models have default templates and setting this flag cause the same issues as I stated before.
Starcode2-3b
print(tokenizer.default_chat_template)
# {% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}
And if you use the default_chat_template results in the following output ❌ :
No chat template is defined for this tokenizer - using the default template for the GPTNeoXTokenizerFast class. If the default is not appropriate for your model, please set
tokenizer.chat_templateto an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.`
==========
Prompt: Write a addition function in python<|endoftext|>
<fim_prefix><fim_suffix>ue_comment>username_0:
==========
Compared to passing the prompt directly without any template ✅ :
==========
Prompt: Write a addition function in python
which takes 2 numbers as input and returns the
==========
Same thing happend with gemma-2b and stable-code-3b.
Every chat/instruction model I tested has the chat_template present and the default_chat_template is identical. This holds true for most if not all chat/instruction models.
TL,DR: We don't need this change, but if we decide to add it we better add:
- A warning that
default_chat_templatemight produce wrong results- A condition to check if
default_chat_templateexists because it might be deprecated in the future.
I am not sure if we don't need it, just as an example, if I want to load the model that is expected to use the default chat template. Currently, we cannot apply a chat template for that model because there is no way to apply the default chat template if the model's tokenizer chat template is None.
For point 1, we now need to manually enable the flag to apply the default chat template; the warning may seem a bit verbose. As for point 2, even if huggingface decides to deprecate the method, it doesn't necessarily mean it will be removed from the base class as that could lead to breaking changes. We can address this once huggingface actually starts deprecating the method.
I am not sure if we don't need it, just as an example, if I want to load the model that is expected to use the default chat template. Currently, we cannot apply a chat template for that model because there is no way to apply the default chat template if the model's tokenizer chat template is None.
I understand that from the start. So far, I have only seen this issue with Command-R, but after talking to the team it became clear it was a mistake on their part during the release which I fixed on our repo by patching the tokenizer chat_template.
How many times have you encountered this issue?
And could you point me to some example chat/instruct models that also have this issue?
How many times have you encountered this issue?
And could you point me to some example chat/instruct models that also have this issue?
I couldn't recall all the default chat template use cases, but since it exists in transformers tokenizer, I will assume there will be a lot of instances for that. So we can either ignore those as edge cases or provide a simple way to apply the default chat template. I don't think we are in a position to push community adoption of the chat template at the moment.
Chat_template is the standard now.
Check majority of SFT chat/instruct models. All of them have it set.
Any base model fine-tuned for chat/instruction using TRL, Axolotl, llama factory and so on also have it set too.
@Blaizzy just to understand your concern a bit. This PR has changed to give another option to the user to basically force use the default template if the model doesn't have a chat template.
Are you concerned that it's a superfluous argument? It seems pretty benign otherwise, no?
Not to argue about whether chat_template should be the standard, just a quick example for the recent instruction model which doesn't use chat_template. OpenCodeInterpreter-DS-33B-> https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B/blob/main/tokenizer_config.json.