RWKV-World-HF-Tokenizer
RWKV-World-HF-Tokenizer copied to clipboard
The tokenizer is 2.5x slower than other huggingface tokenizer and the original blinks world tokenizer
This tokenizer is 2.5x slower than other huggingface tokenizers and the original blinks world tokenizer. The comparison can be tested here: https://colab.research.google.com/gist/cahya-wirawan/932f95ece55c838e186dc3b1c9fcbef4/rwkv-tokenizers.ipynb
It generates also difference token ids for following edge cases:
- space at the beginning of the text: blinks tokenizer for " Hello" = [36786] this tokenizer for " Hello" = [33155]
- space at the end of the text blinks tokenizer for "Hello " = [33155, 33] this tokenizer for "Hello " = [33155]