RWKV-World-HF-Tokenizer icon indicating copy to clipboard operation
RWKV-World-HF-Tokenizer copied to clipboard

The tokenizer is 2.5x slower than other huggingface tokenizer and the original blinks world tokenizer

Open cahya-wirawan opened this issue 8 months ago • 0 comments

This tokenizer is 2.5x slower than other huggingface tokenizers and the original blinks world tokenizer. The comparison can be tested here: https://colab.research.google.com/gist/cahya-wirawan/932f95ece55c838e186dc3b1c9fcbef4/rwkv-tokenizers.ipynb

It generates also difference token ids for following edge cases:

  • space at the beginning of the text: blinks tokenizer for " Hello" = [36786] this tokenizer for " Hello" = [33155]
  • space at the end of the text blinks tokenizer for "Hello " = [33155, 33] this tokenizer for "Hello " = [33155]

cahya-wirawan avatar May 25 '24 11:05 cahya-wirawan