Feature request

Implement mode-aware chat templates for distinct training and inference behaviors

Proposed Solution

To resolve this, I propose adding a new variable called template_mode to indicate whether the template is being used for training or inference. This would allow the template to behave differently based on the mode, supporting the appropriate behavior for both training and inference scenarios.

Implementation Details

Add a new template_mode variable to the chat template.
Adjust the template logic to handle token addition differently based on the mode:
- In training mode: Add EOT tokens as currently implemented.
- In inference mode: Allow completion of the last response without adding EOT tokens.
This approach maintains a single template, reducing the risk of inconsistencies between training and inference.

Open Questions

Should the default template_mode be set to "inference" or "training"? Which would cause the least disruption?
Are there any potential issues or edge cases I should consider with this mode-aware approach?
Would you rather go a different direction with this?

Motivation

Background

While working on improving chat template support across the stack (training, inference servers, and UIs), I encountered an issue with the Llama 3.1 chat template. The current implementation doesn't allow for completing the last response in the list when add_generation_prompt=false, which affects the "complete response" feature in frontends.

Issue

I opened a PR to modify the Llama 3.1 chat template to address this issue by not adding the EOT token if add_generation_prompt=false. However, it was pointed out that this change would cause problems during training. (Reference: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct/discussions/26)

Your contribution

Contribution

Willing to submit a PR with the change after it's decided which way to go.

Aug 23 '24 13:08 Tostino

fyi @Rocketknight1

Aug 27 '24 12:08 ArthurZucker

Hi @Tostino! We've had other requests for allowing chat templates to prefill assistant responses like this, but I'm still not sure how to implement it. My initial idea was actually the opposite of yours - add_generation_prompt is actually generally only used in inference, which means it's kind of like an inference_mode flag already. One possibility would be that if the final message in the chat is an assistant message and add_generation_prompt=True, then we wouldn't add an EOT token to the final message.

I don't think we have any models right now that have multiple assistant messages in a row, so I think this would work. However, it's a little ambiguous.

The other option is to add a template_mode argument like you said, and the third option is just to let people manually strip EOT tokens if they want to prefill responses, which is definitely less convenient, but both of the first two tokens require edits to a lot of existing templates!

This is something we might get some community feedback on before we commit to a plan. WDYT about the add_generation_prompt solution?

Aug 27 '24 13:08 Rocketknight1

Hey there @Rocketknight1.

The one issue I can see with the add_generation_prompt when the assistant is the last turn solution, is it limits the roles that this feature would be useful for.

E.g. when I added chat template support to vLLM, I had to account for the fact that not all models will be trained with a fixed "assistant" role, and added a configuration so that when add_generation_prompt=true, we know what role was actually add to the list of messages by the template (by passing it in as a configuration): https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_chat.py#L204

I can think of another example, where an LLM is trained on both user/assistant turns, and you could have the user as the last message in the list with add_generation_prompt=false, and use it as an autocomplete feature in UIs for the user's input (using the same chat/completions endpoint). I am trying to ensure flexibility for the chat template feature so we aren't artificially limiting what it can be used for by our lack of imagination.

Third option is least preferred IMO. There have been so many issues getting people to be consistent with formatting as things are, it seems like this would be just another footgun to get wrong (and mess up expensive training runs, etc).

Definitely open to feedback though, that's why I opened an issue before just writing the PR. I know making the right decision here is going to be way harder than the code change.

Edit: One thing with the template_mode is if it is added to the apply_chat_template method, but not supported in the templates themselves, nothing at all changes from how things work today. They (templates) can slowly add support as needed without things that currently work breaking.

Aug 27 '24 14:08 Tostino

@Tostino I opened a PR that allows assistant prefill at #33198 without changing templating behaviour - let me know what you think, and feel free to give feedback there!

Aug 29 '24 15:08 Rocketknight1

Mode-aware chat templates for distinct training and inference behaviors

Feature request

Implement mode-aware chat templates for distinct training and inference behaviors

Proposed Solution

Implementation Details

Open Questions

Motivation

Background

Issue

Your contribution

Contribution