tokenizers
tokenizers copied to clipboard
How to Suppress "Using bos_token, but it is not set yet..." in HuggingFace T5 Tokenizer
I'd like to turn off the output that huggingface is generating when I use unique_no_split_tokens so that the following code executes cleanly without all the "Using ..."
In[2] tokenizer = T5Tokenizer.from_pretrained("t5-base") In[3] tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1,100)]), return_tensors="pt").input_ids.size() Out[3]: torch.Size([1, 100]) Using bos_token, but it is not set yet. Using cls_token, but it is not set yet. Using mask_token, but it is not set yet. Using sep_token, but it is not set yet.
Anyone know how to do this?
Hi @MRGLabs ,
I can't seem to reproduce this. Which version of transformers
are you using ?
Btw, T5Tokenizer
is the "slow" version (not this lib), T5TokenizerFast
is the one that uses this lib (AutoTokenizer
should load that one automatically).
Both should be exactly the same (except the speed), at least theoretically.
Hi @Narsil , I'm using transformers 4.16.2.
Thank you for the tip regarding T5TokenizerFast.
I was able to mitigate the issue by explicitly adding special tokens, like so:
tokenizer.add_tokens([f"_{n}" for n in range(1,100)], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
tokenizer.save_pretrained('pathToExtendedTokenizer/')
tokenizer = T5Tokenizer.from_pretrained("pathToExtendedTokenizer/")
Hi @Narsil , I'm using transformers 4.16.2.
I really can't seem to reproduce from a fresh install. Isn't there something change the log level or something in your environment ?
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
print(tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1, 100)]), return_tensors="pt").input_ids.size())
mark
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.