torchchat
torchchat copied to clipboard
Tokenizers tokenizer
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
- [x] Safetensors: #1255
- [x] Bias tensors: #1259
- [x] Tied word embeddings: #1260
Issues
Closes #1251
Description
This PR adds partial support for models that use the tokenizers
(as opposed to tiktoken
or sentencepiece
) for tokenization. This PR only addresses support in the python
runner, and it does so by creating a new class in the tokenizer
module that simply wraps tokenizers
.
Discussion
I'm not sure this is the correct direction to go for solving this since the tokenizers
library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrapping tokenizers
:
Pre-tokenizers
For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in tiktoken
.
From my investigation, I think there are a few candidate paths forward:
- Provide a
c++
implementation of the various tokenization routines fromtokenizers
in a separate implementation of theTokenizer
class. - Extend the existing
c++
TikToken
class to support multiple regexes in the pre-tokenizer- This would also correspond with needing to make the set of patterns configurable and therefore serialized into the
tokenizer.model
artifact, or somehow making these tokenizer arguments an argument at instantiation time.
- This would also correspond with needing to make the set of patterns configurable and therefore serialized into the
NOTE: The corresponding tokenization in llama.cpp
lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model's GGUF
file directly, so at load time, the correct tokenizer is found based on that value.
Special Tokens
Even for models that use a single regex (and even the llama
regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since the tokenizer.model
, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).