transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[RWKV] Add RWKV5 model and RWKVWorldTokenizer

Open BBuf opened this issue 1 year ago • 32 comments

Add RWKVWorldTokenizer for rwkv5 series model.

The tokenizer has been used in:

and lambda test in https://github.com/BBuf/RWKV-World-HF-Tokenizer/blob/main/check_lambda/lambda_hf.py

@xianbaoqian

BBuf avatar Oct 20 '23 15:10 BBuf

Hey feel free to ping me when this is ready! 🤗

ArthurZucker avatar Oct 20 '23 16:10 ArthurZucker

Hi, pr ready now 🤗. @ArthurZucker

BBuf avatar Oct 24 '23 09:10 BBuf

Ok! Thanks, I'll review now, but will let @amyeroberts handle the rest as I'll be off for a week 😉

ArthurZucker avatar Oct 27 '23 14:10 ArthurZucker

Thanks for the PR! Could you explain the motivation behind not using the fast tokenizer, and whether this tokenizer / slow implem of GPT2 for example.

Mostly, this should need a new folder as it's a new model ! If we use the GPT2Tokenizer implementation then we can also just add a .md file ( like we did for flan T5 for example)

The model implementation is the same, only the tokenizer has different options. The tokenizer implemented in this PR is for the RWKV4 World model, but the implementation of the RWKV4 World model is exactly the same as the existing RWKV model implementation.

BBuf avatar Oct 29 '23 02:10 BBuf

@ArthurZucker Hello, I have implemented the RWKV5 model and the RWKVWorldTokenizer it requires. Please review again. Thank you.

BBuf avatar Nov 02 '23 02:11 BBuf

Thanks for the PR! Could you explain the motivation behind not using the fast tokenizer, and whether this tokenizer / slow implem of GPT2 for example.

Mostly, this should need a new folder as it's a new model ! If we use the GPT2Tokenizer implementation then we can also just add a .md file ( like we did for flan T5 for example)

@BBuf, thanks for helping push this!

Im from the RWKV team. So i can help explain this part.

The main motivation for the world tokenizer is to improve support for multi-lingual dataset, within the RWKV generations of models. Especially in character based languages, or languages without "spaces". This benefit applies to for european or nordic languages.

PicoCreator avatar Dec 03 '23 00:12 PicoCreator

Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and tokenizers which natively has a WordLevel tokenizer !

I'm thrilled to help you get this merged 😉

ArthurZucker avatar Dec 04 '23 09:12 ArthurZucker

Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and tokenizers which natively has a WordLevel tokenizer !

I'm thrilled to help you get this merged 😉

Hello, could you take another look at this PR? The recent few commits have added support for batch inference, and I feel it's getting close to being merged.

BBuf avatar Dec 05 '23 01:12 BBuf

Sure I’ll review today! 🤗

ArthurZucker avatar Dec 06 '23 05:12 ArthurZucker

Okay, understood! So this new model uses a word level tokenizer, which can be supported both in transformers (by adding a new tokenizer, with a simple vocab / the code you are proposing) and tokenizers which natively has a WordLevel tokenizer !

I'm thrilled to help you get this merged 😉

I would not call it a "word level" more of a "trie tokenizer", spaces are just simply another character with no special meaning - If that makes sense.

But yes, that in concept this tokenizer could be used for non RWKV architecture, and there is nothing stopping anyone from using our older GPT-neox tokenizer on our newer architecture.

Do let me know if I can clarify anything else from our end, or help in this merge =)

PicoCreator avatar Dec 08 '23 22:12 PicoCreator

Okay! I'll let you know, sorry I got caught up in sprints here and there but will review this early next week 🤗

ArthurZucker avatar Dec 09 '23 10:12 ArthurZucker

All the progress look good! Ping me whenever for another review! 🤗

ArthurZucker avatar Jan 03 '24 07:01 ArthurZucker

Now that rwkv5 pretrained model is out, will this get merged?

winglian avatar Jan 30 '24 01:01 winglian

I'll review again and help merge it asap!

ArthurZucker avatar Jan 30 '24 01:01 ArthurZucker

This yields the following:

>>> from transformers import Rwkv5Tokenizer
>>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
>>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
>>> ids = tokenizer.encode(prompt)

>>> print(ids)
[0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490]
>>> print(tokenizer.tokenize(prompt))
['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
>>> print(tokenizer.decode(tokenizer.encode(prompt)))
<s>Hey how are you? 男:听说你们公司要派你去南方工作

ArthurZucker avatar Feb 05 '24 00:02 ArthurZucker

This yields the following:

>>> from transformers import Rwkv5Tokenizer
>>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
>>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
>>> ids = tokenizer.encode(prompt)

>>> print(ids)
[0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490]
>>> print(tokenizer.tokenize(prompt))
['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
>>> print(tokenizer.decode(tokenizer.encode(prompt)))
<s>Hey how are you? 男:听说你们公司要派你去南方工作

Thank you for your advice. The main problem here is that the original tokenizer implementation(https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer) does not have bos, eos, or pad token, but bos_token_id, eos_token_id, and pad_token_id are all set to 0 . In my implementation on Hugging Face, I have simulated this situation. But now I am unsure what to set for bos, eos, and pad token, as it seems that setting any token would not meet expectations. Therefore, it feels like this tokenizer is a special hack case. I would like to ask if it is acceptable for the tokenizer's definition not to be merged into Transformers repository, and only merge the implementation of the RWKV5 model. Then, this custom tokenizer implementation could be placed in the corresponding repository on Hugging Face, for example: https://huggingface.co/RWKV/HF_v5-Eagle-7B .

BBuf avatar Feb 05 '24 01:02 BBuf

In the code I provided I manually set self._added_tokens_decoder = {0:AddedToken(bos_token)} which forces the token 0 to be the bos_token. We can of course force any other behaviour that way, but we only need to define a string token. This will be very useful to the community as a whole for any SFT training, as it's the expected api.

ArthurZucker avatar Feb 05 '24 02:02 ArthurZucker

Whether or not the original tokenizer has a token, it has token_id=0 which means we can choose the content of the token. I used <s> but we should use something like <|endoftext|> just to make sure it doesn't exist. This should solve all the issues you are having . Doing something like tokenizer.encode("<|endoftext|>") will yield 0. which is what we want

ArthurZucker avatar Feb 05 '24 02:02 ArthurZucker

WDYT?

ArthurZucker avatar Feb 05 '24 02:02 ArthurZucker

In general the model was not trained with a special bos_token, or special pad_token in mind (we use 0, and masking for padding)

So for all these tokens, we typically just use token 0 as fallback, and it "generally works" if that makes sense - so i think defaulting for all these tokens as 0 makes sense to me (coming from the trainer / model team side)

PicoCreator avatar Feb 05 '24 03:02 PicoCreator

I ran into a bug with this tokenizer, sourced from here. I'm not sure how much the two code bases have diverged at this point, but @BlinkDL asked me to report it in this PR.

My issue relates to this code:

tokenized_batches = []
with tqdm(total=len(batches)) as pbar:
    for batch in batches:
        tokenized = tokenizer(
            batch,
            max_length=block_size,
            stride=stride,
            padding="max_length",
            return_overflowing_tokens=True,
            truncation=True,
            return_tensors="np",
        )
        tokenized_batches.append(tokenized["input_ids"])
        pbar.update(1)

tokens = np.concatenate(tokenized_batches)

Typically, the tokenizer should pad these batches to a consistent length, but that's not happening here:

  File "/usr/local/lib/python3.10/dist-packages/aigen/datasets.py", line 188, in encode_tokens
    tokens = np.concatenate(tokenized_batches)
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 24226 and the array at index 1 has size 26680

The fix for me was rather simple, but I feel like it should probably be handled by the tokenizer itself:

padded_batches = []
for batch in tokenized_batches:
    # Calculate the number of padding tokens needed
    padding_length = max_len - batch.shape[1]
    # Pad the batch and add it to the padded_batches list
    if padding_length > 0:
        padded_batch = np.pad(batch, ((0, 0), (0, padding_length)), mode='constant', constant_values=tokenizer.pad_token_id)
    else:
        padded_batch = batch
    padded_batches.append(padded_batch)

Just a PSA. Thanks!

Vectorrent avatar Feb 11 '24 16:02 Vectorrent

Can you try with this one: https://github.com/huggingface/transformers/pull/26963#pullrequestreview-1861671950

ArthurZucker avatar Feb 12 '24 06:02 ArthurZucker

Can you try with this one: #26963 (review)

Thank you, I will have a try.

BBuf avatar Feb 12 '24 08:02 BBuf

This yields the following:

>>> from transformers import Rwkv5Tokenizer
>>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
>>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
>>> ids = tokenizer.encode(prompt)

>>> print(ids)
[0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490]
>>> print(tokenizer.tokenize(prompt))
['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
>>> print(tokenizer.decode(tokenizer.encode(prompt)))
<s>Hey how are you? 男:听说你们公司要派你去南方工作

This yields the following:

>>> from transformers import Rwkv5Tokenizer
>>> tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
>>> prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
>>> ids = tokenizer.encode(prompt)

>>> print(ids)
[0, 6037, 21887, 21338, 22851, 64, 65517, 14631, 19181, 11095, 16765, 10494, 10432, 10708, 11059, 16533, 13848, 10494, 11015, 10964, 13066, 12167, 10490]
>>> print(tokenizer.tokenize(prompt))
['Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
['<s>', 'Hey', ' how', ' are', ' you', '?', ' ', '男', ':', '听', '说', '你', '们', '公', '司', '要', '派', '你', '去', '南', '方', '工', '作']
>>> print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
<s>Hey how are you? 男:听说你们公司要派你去南方工作
>>> print(tokenizer.decode(tokenizer.encode(prompt)))
<s>Hey how are you? 男:听说你们公司要派你去南方工作

I tried this tokenizer, but it seems that I can't get the expected results.

图片

I would like to ask if the rwkv.txt in the code you provided is the original file from https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_vocab_v20230424.txt.

BBuf avatar Feb 13 '24 13:02 BBuf

I converted the vocab to the appropriate format to read it, sorry I forgot that step. Will push it now. I used your tokenizer's encoder

ArthurZucker avatar Feb 14 '24 00:02 ArthurZucker

I converted the vocab to the appropriate format to read it, sorry I forgot that step. Will push it now. I used your tokenizer's encoder

Okay, thanks.

BBuf avatar Feb 14 '24 06:02 BBuf

I used this :

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-5-world-1b5", trust_remote_code=True)

with open("/Users/arthurzucker/Work/transformers/rwkv.txt", "wb") as f:
    for index, token in tokenizer.encoder.item():
        f.write(token +  b"\n")


tokenizer = Rwkv5Tokenizer("/Users/arthurzucker/Work/transformers/rwkv.txt")
prompt = "Hey how are you? 男:听说你们公司要派你去南方工作"
ids = tokenizer.encode(prompt)
print(ids)
print(tokenizer.tokenize(prompt))
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))))
print(tokenizer.decode(tokenizer.encode(prompt)))

ArthurZucker avatar Feb 16 '24 09:02 ArthurZucker

Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5

ArthurZucker avatar Feb 16 '24 09:02 ArthurZucker

Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5

Okay, got it.

BBuf avatar Feb 16 '24 13:02 BBuf

Pushed the tokenizer here: https://huggingface.co/ArthurZ/rwkv-5

Hello, I encountered an error while testing this tokenizer, but I'm not sure how to resolve it.

ERROR: test_added_token_serializable (tests.models.rwkv5.test_tokenization_rwkv5.RWKV5TokenizationTest.test_added_token_serializable) [Rwkv5Tokenizer]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/bbuf/工作目录/RWKV/transformers/tests/test_tokenization_common.py", line 2204, in test_added_token_serializable
    tokenizer.from_pretrained(tmp_dir_name)
  File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/tokenization_utils_base.py", line 2031, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/tokenization_utils_base.py", line 2263, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers-4.38.0.dev0-py3.11.egg/transformers/models/rwkv5/tokenization_rwkv5.py", line 133, in __init__
    self._added_tokens_decoder = {0:AddedToken(bos_token)}
                                    ^^^^^^^^^^^^^^^^^^^^^
TypeError: argument 'content': 'AddedToken' object cannot be converted to 'PyString'

BBuf avatar Feb 16 '24 14:02 BBuf