tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How to Suppress "Using bos_token, but it is not set yet..." in HuggingFace T5 Tokenizer

Open xsys-technology opened this issue 3 years ago • 4 comments

I'd like to turn off the output that huggingface is generating when I use unique_no_split_tokens so that the following code executes cleanly without all the "Using ..."

In[2] tokenizer = T5Tokenizer.from_pretrained("t5-base") In[3] tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1,100)]), return_tensors="pt").input_ids.size() Out[3]: torch.Size([1, 100]) Using bos_token, but it is not set yet. Using cls_token, but it is not set yet. Using mask_token, but it is not set yet. Using sep_token, but it is not set yet.

Anyone know how to do this?

xsys-technology avatar Feb 08 '22 19:02 xsys-technology

Hi @MRGLabs ,

I can't seem to reproduce this. Which version of transformers are you using ?

Btw, T5Tokenizer is the "slow" version (not this lib), T5TokenizerFast is the one that uses this lib (AutoTokenizer should load that one automatically).

Both should be exactly the same (except the speed), at least theoretically.

Narsil avatar Feb 14 '22 12:02 Narsil

Hi @Narsil , I'm using transformers 4.16.2.

Thank you for the tip regarding T5TokenizerFast.

I was able to mitigate the issue by explicitly adding special tokens, like so:

tokenizer.add_tokens([f"_{n}" for n in range(1,100)], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
tokenizer.save_pretrained('pathToExtendedTokenizer/')
tokenizer = T5Tokenizer.from_pretrained("pathToExtendedTokenizer/")

xsys-technology avatar Feb 14 '22 14:02 xsys-technology

Hi @Narsil , I'm using transformers 4.16.2.

I really can't seem to reproduce from a fresh install. Isn't there something change the log level or something in your environment ?

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
print(tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1, 100)]), return_tensors="pt").input_ids.size())

Narsil avatar Feb 15 '22 16:02 Narsil

mark

KpKqwq avatar Aug 01 '22 15:08 KpKqwq

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 29 '24 01:02 github-actions[bot]