torchchat
torchchat copied to clipboard
Add support for `tokenizers` tokenizers
🚀 The feature, motivation and pitch
The request is to extend the tokenizer module in torchchat
to support tokenizers that use the Huggingface tokenizers library.
There are many models out there that use tokenizers
which won't be able to run in torchchat
until they can be loaded and run either via the tokenizers
library directly or via a conversion to tiktoken
or sentencepiece
.
Alternatives
It may be possible to convert a tokenizers
tokenizer to a tiktoken
tokenizer. I have a working implementation of this for the llama
tokenizer.json model, however other models that use different tokenizers
configurations do not work (in particular Granite Code).
Additional context
This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.
I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between tiktoken
and the various tokenizers
pieces (in particular the pretokenizer
s). My branch has a python implementation that simply wraps tokenizers
, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a corresponding c++
implementation. I plan to investigate this further soon!
RFC (Optional)
No response