RWKV-World-HF-Tokenizer icon indicating copy to clipboard operation
RWKV-World-HF-Tokenizer copied to clipboard

Results 4 RWKV-World-HF-Tokenizer issues
Sort by recently updated
recently updated
newest added

I need to use **tokenizer.json** in my project, how should I create it?

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_tokenize to split tokens, which seems to be unfriendly to Chinese characters.

This tokenizer is 2.5x slower than other huggingface tokenizers and the original blinks world tokenizer. The comparison can be tested here: https://colab.research.google.com/gist/cahya-wirawan/932f95ece55c838e186dc3b1c9fcbef4/rwkv-tokenizers.ipynb It generates also difference token ids for following...