tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers

Open bhavnicksm opened this issue 1 year ago β€’ 3 comments
trafficstars

Hi TikToken team! πŸ‘‹

I wanted to share a community resource that might be helpful for TikToken users who also work with HuggingFace tokenizers. I've created AutoTikTokenizer, a lightweight library that allows loading any HuggingFace tokenizer as a TikToken-compatible encoder.

What it does:

  • Enables using TikToken's fast tokenization with any HuggingFace tokenizer
  • Preserves exact encoding/decoding compatibility with original tokenizers
  • Simple drop-in usage similar to HuggingFace's AutoTokenizer

Quick example:

from autotiktokenizer import AutoTikTokenizer

# Load any HF tokenizer as a TikToken encoder
encoder = AutoTikTokenizer.from_pretrained('gpt2')
tokens = encoder.encode("Hello world!")
text = encoder.decode(tokens)

The library is available on PyPI (pip install autotiktokenizer) and is fully open source at: https://github.com/bhavnicksm/autotiktokenizer

I've tested it with several popular models including GPT-2, LLaMA, Mistral, and others. I hope this helps TikToken users who want to work with a broader range of tokenizers while keeping TikToken's performance benefits!

Feel free to check it out if you think it would be useful for the community. Happy to hear any feedback or suggestions!

[Note: This is purely a community contribution - I'm not affiliated with the TikToken team]

bhavnicksm avatar Nov 07 '24 08:11 bhavnicksm

Dear Bhavnick Minhas!

For months I've been searching for any documentation describing the format of "vocab" section of tokenizer.json or any sane code showing how to interpret it. Your code is a perfect example. Where have you been so long? I am so thankful to you for your work!

idruker-cerence avatar Dec 15 '24 01:12 idruker-cerence

Hey @idruker-cerence!

I'm glad to hear that~ 😊

Please let me know if you have any questions on the implementation details as well, happy to clarify and share resources.

And, always open to feedback!

Thanks! ☺️

bhavnicksm avatar Dec 15 '24 10:12 bhavnicksm

hi is AutoTikTokenizer still maintained/alive? the repo google has indexed 404s

coopslarhette avatar Apr 26 '25 01:04 coopslarhette