Arthur
Arthur
Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code...
Sure I’ll review today! 🤗
Okay! I'll let you know, sorry I got caught up in sprints here and there but will review this early next week 🤗
All the progress look good! Ping me whenever for another review! 🤗
I'll review again and help merge it asap!
This yields the following: ```python >>> from transformers import Rwkv5Tokenizer >>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt") >>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作" >>> ids = tokenizer.encode(prompt) >>> print(ids) [0, 6037,...
In the code I provided I manually set `self._added_tokens_decoder = {0:AddedToken(bos_token)}` which forces the token 0 to be the `bos_token`. We can of course force any other behaviour that way,...
Whether or not the original tokenizer has a `token`, it has `token_id=0` which means we can choose the content of the token. I used `` but we should use something...
Can you try with this one: https://github.com/huggingface/transformers/pull/26963#pullrequestreview-1861671950