Error while loading GPT2 tokenizer with specifying "unk_token"
System Info
transformersversion: 4.28.0.dev0- Platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
For a certain reason, I need to modify the default unk_token of GPT2Tokenizer. Currently, it is "<|endoftext|>". When I tried to change it, I encounter problems.
from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "unk_token": "<|unk|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.encode(["<|unk|>"])
, where directory ./tokenizer has all tokenizer files provided by gpt2-small: tokenizer.json, merges.txt, vocab.json
error information:
Traceback (most recent call last): File "./model/unit_test_customed_gpt2.py", line 451, in test_BuildMappingFileTestCase_bpe_mhp_gpt self.tokenizer.build_mapping_file(self.mapped_tokenizer, "./tokenizer/customed-mhp-gpt-bpe/mapping_%s.json"%text, max_length=32, is_chinese_vocab=False) File "/home/X/scratch/variable-text-segmentation/data_utils/sp_tokenizer.py", line 500, in build_mapping_file mapping_ids= mapping_tokenizer.encode(mapped_text,add_special_tokens=False) File "/home/lsiyang/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2302, in encode encoded_inputs = self.encode_plus( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2710, in encode_plus return self._encode_plus( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 650, in _encode_plus return self.prepare_for_model( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3189, in prepare_for_model encoded_inputs = self.pad( File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2979, in pad raise ValueError( ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.
I may know the reason. When we specify a new token as unk_token via GPT2Tokenizer.from_pretrained(*, unk_token=XX), it would not first add this new token to the vocabulary but only update self.tokenizer.unk_token=XX. It makes the tokenizer able to correctly return its unk_token but actually cannot find the token id of that new unk_token in the vocab. The problem lies in tokenization_utils.py
def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
new_tokens = [str(tok) for tok in new_tokens]
tokens_to_add = []
for token in new_tokens:
if not isinstance(token, str):
raise TypeError(f"Token {token} is not a string but a {type(token)}.")
if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
token = token.lower()
if (
token != self.unk_token #PROBLEM! self.unk_token has been changed to the newest. So newest unk_token cannot be added.
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
and token not in tokens_to_add
):
tokens_to_add.append(token)
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")
For other tokens, like sep_token, it is allowed to specify it via GPT2Tokenizer.from_pretrained(*, sep_token=XX). Even if it doesn't exist in vocab, it would add a new token to vocab.
This is also impossible.
from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.add_special_tokens({"unk_token": "<|unk|>"})
tokenizer.encode(["<|unk|>"])
I think we should also allow unk_token specification before its existence, like other special tokens.
Expected behavior
I think we should also allow unk_token specification before its existence, like other special tokens
Hey! Indeed this is a problem I stumbled on when integrating Whisper.
Two things are at play for me here:
- We should support re-assignment of the unk token, so yes PR makes sens (and I think it makes sens for all tokenizers). The following output is not good:
In [9]: tokenizer.all_special_ids
Out[9]: [50256, None, 50257, 50258, 50259, 50260]
Which is what we get when trying to add this token. So I am in for the fix 2. As we can see in the traceback, when a token is OOV, we don't raise an error ourself, which ends up being a bit hard to debug. We can't really change the default behaviour for GPT2 (it's too old), but we can raise the error ourselves! ( I'll probably tackle this in another PR!) Good catch! 🔥
(cc @Narsil fyi)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I still have this issue when using : tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
This is the output error from bark :
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. tokenizer.all_special_ids : [None, 0, 1, 2, 3]
BertTokenizer(name_or_path='bert-base-multilingual-cased', vocab_size=0, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '<|unk|>', 'sep_token': '<|sep|>', 'pad_token': '<|pad|>', 'cls_token': '<|cls|>', 'mask_token': '<|mask|>'}, clean_up_tokenization_spaces=True)
Traceback (most recent call last):
File "/home/gpc2/codes_ood/Codes/TTS/bark/text_to_speech_bark.py", line 11, in
Thanks for reporting, as you can see the PR is still open, the bug has not been adressed yet! I'll take care of it, this is also related to the potential refactoring of how tokens are added.