torchchat icon indicating copy to clipboard operation
torchchat copied to clipboard

Add support for `tokenizers` tokenizers

Open gabe-l-hart opened this issue 4 months ago • 5 comments

🚀 The feature, motivation and pitch

The request is to extend the tokenizer module in torchchat to support tokenizers that use the Huggingface tokenizers library.

There are many models out there that use tokenizers which won't be able to run in torchchat until they can be loaded and run either via the tokenizers library directly or via a conversion to tiktoken or sentencepiece.

Alternatives

It may be possible to convert a tokenizers tokenizer to a tiktoken tokenizer. I have a working implementation of this for the llama tokenizer.json model, however other models that use different tokenizers configurations do not work (in particular Granite Code).

Additional context

This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.

I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between tiktoken and the various tokenizers pieces (in particular the pretokenizers). My branch has a python implementation that simply wraps tokenizers, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a corresponding c++ implementation. I plan to investigate this further soon!

RFC (Optional)

No response

gabe-l-hart avatar Oct 01 '24 22:10 gabe-l-hart