Add `bot_token` attribute to `PreTrainedTokenizer` and `PreTrainedTokenizerFast`

Open aw632 opened this issue 1 year ago • 0 comments

Feature request

I'm requesting for the attribute bot_token (beginning-of-tools token) to be added to the PreTrainedTokenizer classes, similar to eos_token. This token would be associated with self.bot_token and self.bot_token_id and would expose the token to downstream consumers like vLLM.

Motivation

This request builds off this PR comment as well as the ongoing work to support function calling in transformers.

A number of downstream consumers depend on what's available in the PreTrainedTokenizer classes, like vLLM's Sequence class and LLMEngine class example. For example, the current problem I'm facing is that vLLM doesn't correctly label the finish reason for "tool call" outputs, as, well, tool calls, since the CompletionOutput.finish_reason ultimately relies on the attributes available in PreTrainedTokenizer.

As open-source tool calling proliferates, having these attributes exposed would greatly enhance the utility of the library. This token can be set to None by default and should be backwards compatible with the right implementation.

Your contribution

I can help contribute to the PR and write code. Might need help navigating the library + writing good test cases.

Jun 29 '24 04:06 aw632