mlx-examples chore(mlx-lm): enable to apply default chat template

The LLM community seems to be moving towards the standard chat template, with many models starting to use the default chat template. Enabling the default chat template for mlx-lm may be more beneficial. Additionally, users can always manually disable the chat template using the --ignore-chat-template flag.

@awni, please feel free to close this PR if you feel it does not fit the purpose.

Mar 14 '24 06:03 mzbac

Yeah, the chat template will fallback to the default chat template. The reason for originally checking the chat template was to avoid using the default because it may generate unexpected output. FYI: https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/tokenization_utils_base.py#L1797 https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/tokenization_utils_base.py#L1736-L1740

Mar 14 '24 11:03 mzbac

Thanks!

Indeed you have a point, and I see we all converging to a defacto template.

However, what happens when it's a pretrained model without any instruction-tuning? Won't defaulting (fallback) to chatML produce bad results?

Mar 14 '24 12:03 Blaizzy

In that case, you may need to use the --ignore-chat-template flag, which makes sense given there are more use cases in instructed models than base models.

Mar 14 '24 12:03 mzbac

Makes sense to me :)

As a ML Engineer this is intuitive.

I'm just thinking about UX for the users at large that might understand distinction.

They could just run generate and have the abstraction figure out the template formatting if any.

Mar 14 '24 12:03 Blaizzy

Just curious, what cases does this fix that currently do not work? Don't the instruction tuned models have the template in the tokenizer? This would fix the commandR issue for example where it did not include the chat template?

Mar 14 '24 13:03 awni

Just curious, what cases does this fix that currently do not work? Don't the instruction tuned models have the template in the tokenizer? This would fix the commandR issue for example where it did not include the chat template?

Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.

Mar 14 '24 14:03 mzbac

Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.

Oh, I get it now. That’s the behaviour I was expecting.

Could you point for a model with such a configuration? I would like to check it out.

Mar 14 '24 14:03 Blaizzy

Currently, the implementation only applies the chat template when the tokenizer has an explicitly specified chat template in the tokenizer configuration or implementation. Recently released models tend to use the default chat template (ChatML format). In that case, mlx_lm.generate won't apply the chat template.

Oh, I get it now. That’s the behaviour I was expecting.

Could you point for a model with such a configuration? I would like to check it out.

Yeah, for example explicitly in the chat template configuration: https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-6.7B/blob/main/tokenizer_config.json#L181 For the c4ai-command-r-v01 model, the tokenizer only defined the default chat template: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenization_cohere_fast.py#L193

Mar 14 '24 14:03 mzbac

@mzbac Got it,

It works with both pre-trained and instruction models. 👍🏽

But I noticed that with starcoder2-3b it gives inconsistent results.

python -m mlx_lm.generate --model mlx-community/starcoder2-3b-4bit --prompt "Write a addition function in python" --colorize --max-tokens 10

Current implementation:

==========
Prompt: Write a addition function in python
 which takes 2 numbers as input and returns the
==========

With your change:

==========
Prompt: Write a addition function in python<|endoftext|>
<fim_prefix><fim_suffix>ue_comment>username_0:
==========

Additionally, with your change the tokenizer for some reason adds a eos token to the prompt.

Mar 14 '24 15:03 Blaizzy

Found the issue, without the condition to check tokenizer.chat_template is not None, the condition becomes True which trigger this warning:

No chat template is defined for this tokenizer - using the default template for the GPT2TokenizerFast class. If the default is not appropriate for your model, please set tokenizer.chat_template to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

And that's how you get the eot token.

==========
Prompt: Write a addition function in python<|endoftext|>

Mar 14 '24 15:03 Blaizzy

@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?

Mar 20 '24 00:03 awni

@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?

It may not really be an issue because the Starcoder is not a chat model. Users have to disable the chat template by using --ignore-chat-template.

Mar 20 '24 00:03 mzbac

🤔 so if I understand correctly, this PR will improve the default setting for some models (by using the default chat template) but for other models it might be worse because the default chat template has issues.

I'm not really sure how to decide whether to land it or not. So far it seems a bit of an edge case that a model expects to use the default template but doesn't have it set, is that fair to say? Maybe the right way to do this is to have a --use-default-template flag or something like that?

I might be wrong on that assessment, I think we should make the default the more common case though, any input on that would be useful.

Mar 20 '24 00:03 awni

@awni, yes, it used to be an edge case where the model expected to use the default template. However, recently models seem to be moving towards using the default chat template. The HF tokenizer.apply_chat_template is showing a warning when applying the default template, so users should be aware that the default template is being used and they can manually disable it. I am happy to close this PR for now if we still feel that this is an edge case. Maybe we can wait for a while to see how the community moves, and then decide which is the most common use case.

Mar 20 '24 00:03 mzbac

The HF tokenizer.apply_chat_template is showing a warning when applying the default template

Nice, that's useful to know!

I am happy to close this PR for now if we still feel that this is an edge case. Maybe we can wait for a while to see how the community moves, and then decide which is the most common use case.

I would suggest we do one of two things:

Give an option to use the default if desired (right now it's not possible if the template is None).
Keep it this way and keep the option to shut off the default template if desired.

I really couldn't say which is a better choice. I have a slight preference to merge what you have, see how it goes and reassess if it becomes an issue. It seems like we are providing useful output even when the default is wrong. But we could also switch it to 1 and do the same. Wdyt?

Mar 20 '24 00:03 awni

Personally, I feel that choosing option 1 may be better since it is more explicit. If you don't mind, I can update the PR to go with option 1.

Mar 20 '24 01:03 mzbac

Of course, sounds good to me! Thank you!

Mar 20 '24 02:03 awni

@Blaizzy @mzbac it sounds like there is still an issue here? Are you intending to send a fix?

I was a unwell but I'm back.

@mzbac's last commit (df1eb23) addresses the issue I reported.

@awni and @mzbac I would like to point out that chat_template is the new standard and default_chat_template exists only for backwards compatibility. It's recommend to override the default_chat_template by setting chat_template and not the other way arround. According to HF, default_chat_template are likely to be dropped in favor of chat_template [1].

I could be wrong but IMO we should be patching the tokenizers on the MLX community hub and of future converted models by setting the chat_template for the ones using default, instead of using the flag during inference.

Here is summary from the HF article discussing templates [1]:

Prior to chat templates, chat handling was hardcoded per model class. For backwards compatibility, these class-specific handlers are now default class-level templates. If a model lacks a chat template but its class has a default template, the default template is used instead.

However, explicitly setting chat_template is recommended, even when the class default is suitable, to clarify chat configuration and future-proof against potential default template changes or deprecation.

You can read more here:

[1] Templates for Chat Models

https://huggingface.co/docs/transformers/en/chat_templating#advanced-how-do-chat-templates-work
https://huggingface.co/docs/transformers/en/chat_templating#what-are-default-templates

Mar 20 '24 08:03 Blaizzy

The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models. This really depends on the community adoption of the chat_template. We may just have to wait and see.

Mar 20 '24 09:03 mzbac

The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models.

Could you elaborate? I meant patching tokenizers already in the MLX-community hub only and any new mlx-lm supported model.

This really depends on the community adoption of the chat_template. We may just have to wait and see.

Agreed!

Mar 20 '24 09:03 Blaizzy

The challenge with patching the chat_template is sometimes it's quite hard to get the original model patched since the mlx-lm is compatible with HF format models, which may introduce inconsistent behavior between models.

Could you elaborate? I meant patching tokenizers already in the MLX-community hub only and any new mlx-lm supported model.

This really depends on the community adoption of the chat_template. We may just have to wait and see.

Agreed!

Yeah, the mlx-lm can directly load HF models. So if the original model has not been patched and loaded via mlx-lm as expected using the default chat model, it may behave differently than the model we patched in mlx-community.

Mar 20 '24 09:03 mzbac

Yeah, the mlx-lm can directly load HF models. So if the original model has not been patched and loaded via mlx-lm as expected using the default chat model, it may behave differently than the model we patched in mlx-community.

I see your point!

At the moment, the default behavior for those models is to skip the templates and pass the prompt directly, right?

If so, are the any examples of models that behave strangely with this current setup?

Mar 20 '24 10:03 Blaizzy

TL,DR: We don't need this change, but if we decide to add it we better add:

A warning that default_chat_template might produce wrong results
A condition to check if default_chat_template exists because it might be deprecated in the future.

I ran some tests and even non-chat models have default templates and setting this flag cause the same issues as I stated before.

Starcode2-3b

print(tokenizer.default_chat_template)

# {% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}

And if you use the default_chat_template results in the following output ❌ :

No chat template is defined for this tokenizer - using the default template for the GPTNeoXTokenizerFast class. If the default is not appropriate for your model, please set tokenizer.chat_template to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.`

==========
Prompt: Write a addition function in python<|endoftext|>
<fim_prefix><fim_suffix>ue_comment>username_0:
==========

Compared to passing the prompt directly without any template ✅ :

==========
Prompt: Write a addition function in python
 which takes 2 numbers as input and returns the
==========

Same thing happend with gemma-2b and stable-code-3b.

Every chat/instruction model I tested has the chat_template present and the default_chat_template is identical. This holds true for most if not all chat/instruction models.

Mar 20 '24 11:03 Blaizzy

TL,DR: We don't need this change, but if we decide to add it we better add:

A warning that default_chat_template might produce wrong results

A condition to check if default_chat_template exists because it might be deprecated in the future.

I am not sure if we don't need it, just as an example, if I want to load the model that is expected to use the default chat template. Currently, we cannot apply a chat template for that model because there is no way to apply the default chat template if the model's tokenizer chat template is None.

For point 1, we now need to manually enable the flag to apply the default chat template; the warning may seem a bit verbose. As for point 2, even if huggingface decides to deprecate the method, it doesn't necessarily mean it will be removed from the base class as that could lead to breaking changes. We can address this once huggingface actually starts deprecating the method.

Mar 20 '24 13:03 mzbac

I am not sure if we don't need it, just as an example, if I want to load the model that is expected to use the default chat template. Currently, we cannot apply a chat template for that model because there is no way to apply the default chat template if the model's tokenizer chat template is None.

I understand that from the start. So far, I have only seen this issue with Command-R, but after talking to the team it became clear it was a mistake on their part during the release which I fixed on our repo by patching the tokenizer chat_template.

Mar 20 '24 14:03 Blaizzy

How many times have you encountered this issue?

And could you point me to some example chat/instruct models that also have this issue?

Mar 20 '24 14:03 Blaizzy

How many times have you encountered this issue?

And could you point me to some example chat/instruct models that also have this issue?

I couldn't recall all the default chat template use cases, but since it exists in transformers tokenizer, I will assume there will be a lot of instances for that. So we can either ignore those as edge cases or provide a simple way to apply the default chat template. I don't think we are in a position to push community adoption of the chat template at the moment.

Mar 20 '24 14:03 mzbac

Chat_template is the standard now.

Check majority of SFT chat/instruct models. All of them have it set.

Any base model fine-tuned for chat/instruction using TRL, Axolotl, llama factory and so on also have it set too.

Mar 20 '24 14:03 Blaizzy

@Blaizzy just to understand your concern a bit. This PR has changed to give another option to the user to basically force use the default template if the model doesn't have a chat template.

Are you concerned that it's a superfluous argument? It seems pretty benign otherwise, no?

Mar 20 '24 14:03 awni

Not to argue about whether chat_template should be the standard, just a quick example for the recent instruction model which doesn't use chat_template. OpenCodeInterpreter-DS-33B-> https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B/blob/main/tokenizer_config.json.

Mar 20 '24 14:03 mzbac

mlx-examples mlx-examples copied to clipboard

chore(mlx-lm): enable to apply default chat template

mlx-examples
mlx-examples copied to clipboard