tiktokenizer
tiktokenizer copied to clipboard
Add other BPE models
This is a hack which converts huggingface/tokenizers
JSON configuration into a format acceptable by openai/tiktoken and @dqbd/tiktoken.
This is mostly a stop-gap solution / interesting research avenue in order to port tiktoken models into huggingface/tokenizers.
- [ ] Revisit invalid checks when merging ranks
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
tiktokenizer | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Apr 12, 2023 4:13pm |
However, the best solution would be to use unstable_wasm
of huggingface/tokenizers
to support SentencePiece, Unigram and other tokenizers.
@dqbd Thank you for sharing interesting work! I also want to compare our own tokenizer using this project, but I'm stuck at converting huggingface tokenizer.json file into tiktoken json file like calude.json or gptj.json. Could you share any code for this :)?