transformers
transformers copied to clipboard
[RWKV] Add RWKV5 model and RWKVWorldTokenizer
Add RWKVWorldTokenizer for rwkv5 series model.
The tokenizer has been used in:
- RWKV/rwkv-5-world-1b5
- RWKV/rwkv-5-world-3b
- RWKV/rwkv-4-world-169m
- RWKV/rwkv-4-world-430m
- RWKV/rwkv-4-world-1b5
- RWKV/rwkv-4-world-3b
- RWKV/rwkv-4-world-7b
and lambda test in https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/check_lambda/lambda_hf.py
@xianbaoqian
Hey feel free to ping me when this is ready! 🤗
Hi, pr ready now 🤗. @ArthurZucker
Ok! Thanks, I'll review now, but will let @amyeroberts handle the rest as I'll be off for a week 😉
Thanks for the PR! Could you explain the motivation behind not using the fast tokenizer, and whether this tokenizer / slow implem of GPT2 for example.
Mostly, this should need a new folder as it's a new model ! If we use the GPT2Tokenizer implementation then we can also just add a .md file ( like we did for flan T5 for example)
The model implementation is the same, only the tokenizer has different options. The tokenizer implemented in this PR is for the RWKV4 World model, but the implementation of the RWKV4 World model is exactly the same as the existing RWKV model implementation.
@ArthurZucker Hello, I have implemented the RWKV5 model and the RWKVWorldTokenizer it requires. Please review again. Thank you.
Thanks for the PR! Could you explain the motivation behind not using the fast tokenizer, and whether this tokenizer / slow implem of GPT2 for example.
Mostly, this should need a new folder as it's a new model ! If we use the GPT2Tokenizer implementation then we can also just add a .md file ( like we did for flan T5 for example)
@BBuf, thanks for helping push this!
Im from the RWKV team. So i can help explain this part.
The main motivation for the world tokenizer is to improve support for multi-lingual dataset, within the RWKV generations of models. Especially in character based languages, or languages without "spaces". This benefit applies to for european or nordic languages.
Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and tokenizers
which natively has a WordLevel
tokenizer !
I'm thrilled to help you get this merged 😉
Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and
tokenizers
which natively has aWordLevel
tokenizer !I'm thrilled to help you get this merged 😉
Hello, could you take another look at this PR? The recent few commits have added support for batch inference, and I feel it's getting close to being merged.
Sure I’ll review today! 🤗
Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and
tokenizers
which natively has aWordLevel
tokenizer !I'm thrilled to help you get this merged 😉
I would not call it a "word level" more of a "trie tokenizer", spaces are just simply another character with no special meaning - If that makes sense.
But yes, that in concept this tokenizer could be used for non RWKV architecture, and there is nothing stopping anyone from using our older GPT-neox tokenizer on our newer architecture.
Do let me know if I can clarify anything else from our end, or help in this merge =)
Okay! I'll let you know, sorry I got caught up in sprints here and there but will review this early next week 🤗
All the progress look good! Ping me whenever for another review! 🤗
Now that rwkv5 pretrained model is out, will this get merged?
I'll review again and help merge it asap!
This yields the following:
>>> from transformers import Rwkv5Tokenizer
>>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
>>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
>>> ids = tokenizer.encode(prompt)
>>> print(ids)
[0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490]
>>> print(tokenizer.tokenize(prompt))
['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
>>> print(tokenizer.decode(tokenizer.encode(prompt)))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
This yields the following:
>>> from transformers import Rwkv5Tokenizer >>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt") >>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作" >>> ids = tokenizer.encode(prompt) >>> print(ids) [0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490] >>> print(tokenizer.tokenize(prompt)) ['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))) ['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))) <s>Hey how are you? 男:听说你们公司要派你去南方工作 >>> print(tokenizer.decode(tokenizer.encode(prompt))) <s>Hey how are you? 男:听说你们公司要派你去南方工作
Thank you for your advice. The main problem here is that the original tokenizer implementation(https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer) does not have bos, eos, or pad token, but bos_token_id, eos_token_id, and pad_token_id are all set to 0
. In my implementation on Hugging Face, I have simulated this situation. But now I am unsure what to set for bos, eos, and pad token, as it seems that setting any token would not meet expectations. Therefore, it feels like this tokenizer is a special hack case. I would like to ask if it is acceptable for the tokenizer's definition not to be merged into Transformers repository, and only merge the implementation of the RWKV5 model. Then, this custom tokenizer implementation could be placed in the corresponding repository on Hugging Face, for example: https://huggingface.co/RWKV/HF_v5-Eagle-7B .
In the code I provided I manually set self._added_tokens_decoder = {0:AddedToken(bos_token)}
which forces the token 0 to be the bos_token
. We can of course force any other behaviour that way, but we only need to define a string token. This will be very useful to the community as a whole for any SFT training, as it's the expected api.
Whether or not the original tokenizer has a token
, it has token_id=0
which means we can choose the content of the token. I used <s>
but we should use something like <|endoftext|>
just to make sure it doesn't exist. This should solve all the issues you are having . Doing something like tokenizer.encode("<|endoftext|>")
will yield 0
. which is what we want
WDYT?
In general the model was not trained with a special bos_token, or special pad_token in mind (we use 0, and masking for padding)
So for all these tokens, we typically just use token 0 as fallback, and it "generally works" if that makes sense - so i think defaulting for all these tokens as 0 makes sense to me (coming from the trainer / model team side)
I ran into a bug with this tokenizer, sourced from here. I'm not sure how much the two code bases have diverged at this point, but @BlinkDL asked me to report it in this PR.
My issue relates to this code:
tokenized_batches = []
with tqdm(total=len(batches)) as pbar:
for batch in batches:
tokenized = tokenizer(
batch,
max_length=block_size,
stride=stride,
padding="max_length",
return_overflowing_tokens=True,
truncation=True,
return_tensors="np",
)
tokenized_batches.append(tokenized["input_ids"])
pbar.update(1)
tokens = np.concatenate(tokenized_batches)
Typically, the tokenizer should pad these batches to a consistent length, but that's not happening here:
File "/usr/local/lib/python3.10/dist-packages/aigen/datasets.py", line 188, in encode_tokens
tokens = np.concatenate(tokenized_batches)
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 24226 and the array at index 1 has size 26680
The fix for me was rather simple, but I feel like it should probably be handled by the tokenizer itself:
padded_batches = []
for batch in tokenized_batches:
# Calculate the number of padding tokens needed
padding_length = max_len - batch.shape[1]
# Pad the batch and add it to the padded_batches list
if padding_length > 0:
padded_batch = np.pad(batch, ((0, 0), (0, padding_length)), mode='constant', constant_values=tokenizer.pad_token_id)
else:
padded_batch = batch
padded_batches.append(padded_batch)
Just a PSA. Thanks!
Can you try with this one: https://github.com/huggingface/transformers/pull/26963#pullrequestreview-1861671950
This yields the following:
>>> from transformers import Rwkv5Tokenizer >>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt") >>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作" >>> ids = tokenizer.encode(prompt) >>> print(ids) [0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490] >>> print(tokenizer.tokenize(prompt)) ['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))) ['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))) <s>Hey how are you? 男:听说你们公司要派你去南方工作 >>> print(tokenizer.decode(tokenizer.encode(prompt))) <s>Hey how are you? 男:听说你们公司要派你去南方工作
This yields the following:
>>> from transformers import Rwkv5Tokenizer >>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt") >>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作" >>> ids = tokenizer.encode(prompt) >>> print(ids) [0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490] >>> print(tokenizer.tokenize(prompt)) ['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))) ['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作'] >>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))) <s>Hey how are you? 男:听说你们公司要派你去南方工作 >>> print(tokenizer.decode(tokenizer.encode(prompt))) <s>Hey how are you? 男:听说你们公司要派你去南方工作
I tried this tokenizer, but it seems that I can't get the expected results.
I would like to ask if the rwkv.txt
in the code you provided is the original file from https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_vocab_v20230424.txt.
I converted the vocab to the appropriate format to read it, sorry I forgot that step. Will push it now. I used your tokenizer's encoder
I converted the vocab to the appropriate format to read it, sorry I forgot that step. Will push it now. I used your tokenizer's
encoder
Okay, thanks.
I used this :
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-5-world-1b5", trust_remote_code=True)
with open("/Users/arthurzucker/Work/transformers/rwkv.txt", "wb") as f:
for index, token in tokenizer.encoder.item():
f.write(token + b"\n")
tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
ids = tokenizer.encode(prompt)
print(ids)
print(tokenizer.tokenize(prompt))
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
print(tokenizer.decode(tokenizer.encode(prompt)))
Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5
Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5
Okay, got it.
Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5
Hello, I encountered an error while testing this tokenizer, but I'm not sure how to resolve it.
ERROR: test_added_token_serializable (tests.models.rwkv5.test_tokenization_rwkv5.RWKV5TokenizationTest.test_added_token_serializable) [Rwkv5Tokenizer]
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/bbuf/工作目录/RWKV/transformers/tests/test_tokenization_common.py", line 2204, in test_added_token_serializable
tokenizer.from_pretrained(tmp_dir_name)
File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/tokenization_utils_base.py", line 2031, in from_pretrained
return cls._from_pretrained(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/tokenization_utils_base.py", line 2263, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/models/rwkv5/tokenization_rwkv5.py", line 133, in __init__
self._added_tokens_decoder = {0:AddedToken(bos_token)}
^^^^^^^^^^^^^^^^^^^^^
TypeError: argument 'content': 'AddedToken' object cannot be converted to 'PyString'