transformers Error while loading GPT2 tokenizer with specifying "unk

System Info

transformers version: 4.28.0.dev0
Platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-glibc2.17
Python version: 3.8.16
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 1.11.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

For a certain reason, I need to modify the default unk_token of GPT2Tokenizer. Currently, it is "<|endoftext|>". When I tried to change it, I encounter problems.

from transformers import GPT2Tokenizer

control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "unk_token": "<|unk|>"}

tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.encode(["<|unk|>"])

, where directory ./tokenizer has all tokenizer files provided by gpt2-small: tokenizer.json, merges.txt, vocab.json

error information:

Traceback (most recent call last): File "./model/unit_test_customed_gpt2.py", line 451, in test_BuildMappingFileTestCase_bpe_mhp_gpt self.tokenizer.build_mapping_file(self.mapped_tokenizer, "./tokenizer/customed-mhp-gpt-bpe/mapping_%s.json"%text, max_length=32, is_chinese_vocab=False) File "/home/X/scratch/variable-text-segmentation/data_utils/sp_tokenizer.py", line 500, in build_mapping_file mapping_ids= mapping_tokenizer.encode(mapped_text,add_special_tokens=False) File "/home/lsiyang/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2302, in encode encoded_inputs = self.encode_plus( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2710, in encode_plus return self._encode_plus( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 650, in _encode_plus return self.prepare_for_model( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3189, in prepare_for_model encoded_inputs = self.pad( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2979, in pad raise ValueError( ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.

I may know the reason. When we specify a new token as unk_token via GPT2Tokenizer.from_pretrained(*, unk_token=XX), it would not first add this new token to the vocabulary but only update self.tokenizer.unk_token=XX. It makes the tokenizer able to correctly return its unk_token but actually cannot find the token id of that new unk_token in the vocab. The problem lies in tokenization_utils.py

    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
        new_tokens = [str(tok) for tok in new_tokens]
        tokens_to_add = []
        for token in new_tokens:
            if not isinstance(token, str):
                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
                token = token.lower()
            if (
                token != self.unk_token  #PROBLEM!  self.unk_token has been changed to the newest. So newest unk_token cannot be added.
                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
                and token not in tokens_to_add
            ):
                tokens_to_add.append(token)
                if self.verbose:
                    logger.info(f"Adding {token} to the vocabulary")

For other tokens, like sep_token, it is allowed to specify it via GPT2Tokenizer.from_pretrained(*, sep_token=XX). Even if it doesn't exist in vocab, it would add a new token to vocab.

This is also impossible.

from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.add_special_tokens({"unk_token": "<|unk|>"})
tokenizer.encode(["<|unk|>"])

I think we should also allow unk_token specification before its existence, like other special tokens.

Expected behavior

I think we should also allow unk_token specification before its existence, like other special tokens

Mar 28 '23 02:03 lsy641

Hey! Indeed this is a problem I stumbled on when integrating Whisper. Two things are at play for me here:

We should support re-assignment of the unk token, so yes PR makes sens (and I think it makes sens for all tokenizers). The following output is not good:

In [9]: tokenizer.all_special_ids
Out[9]: [50256, None, 50257, 50258, 50259, 50260]

Which is what we get when trying to add this token. So I am in for the fix 2. As we can see in the traceback, when a token is OOV, we don't raise an error ourself, which ends up being a bit hard to debug. We can't really change the default behaviour for GPT2 (it's too old), but we can raise the error ourselves! ( I'll probably tackle this in another PR!) Good catch! 🔥

(cc @Narsil fyi)

Mar 29 '23 09:03 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 27 '23 15:04 github-actions[bot]

I still have this issue when using : tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

This is the output error from bark :

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. tokenizer.all_special_ids : [None, 0, 1, 2, 3]

BertTokenizer(name_or_path='bert-base-multilingual-cased', vocab_size=0, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '<|unk|>', 'sep_token': '<|sep|>', 'pad_token': '<|pad|>', 'cls_token': '<|cls|>', 'mask_token': '<|mask|>'}, clean_up_tokenization_spaces=True)

Traceback (most recent call last): File "/home/gpc2/codes_ood/Codes/TTS/bark/text_to_speech_bark.py", line 11, in audio_array = generate_audio(text_prompt) File "/home/gpc2/codes_ood/Codes/TTS/bark/bark/api.py", line 107, in generate_audio semantic_tokens = text_to_semantic( File "/home/gpc2/codes_ood/Codes/TTS/bark/bark/api.py", line 25, in text_to_semantic x_semantic = generate_text_semantic( File "/home/gpc2/codes_ood/Codes/TTS/bark/bark/generation.py", line 434, in generate_text_semantic encoded_text = np.array(_tokenize(tokenizer, text)) + TEXT_ENCODING_OFFSET File "/home/gpc2/codes_ood/Codes/TTS/bark/bark/generation.py", line 356, in _tokenize return tokenizer.encode(text, add_special_tokens=False) File "/home/gpc2/anaconda3/envs/valle/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2319, in encode encoded_inputs = self.encode_plus( File "/home/gpc2/anaconda3/envs/valle/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2727, in encode_plus return self._encode_plus( File "/home/gpc2/anaconda3/envs/valle/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 652, in _encode_plus return self.prepare_for_model( File "/home/gpc2/anaconda3/envs/valle/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3206, in prepare_for_model encoded_inputs = self.pad( File "/home/gpc2/anaconda3/envs/valle/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2996, in pad raise ValueError( ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.

Jun 27 '23 15:06 nassimabenammar

Thanks for reporting, as you can see the PR is still open, the bug has not been adressed yet! I'll take care of it, this is also related to the potential refactoring of how tokens are added.

Jun 28 '23 02:06 ArthurZucker

Error while loading GPT2 tokenizer with specifying "unk_token"

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior