transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Fixed NoneType attribute crash in tokenization_utils_base.py

Open ElleLeonne opened this issue 1 year ago • 4 comments

The attribute self.model_max_length is not universally set in all tokenizers.

If these cases, the tokenizer will crash the program without the listed change.

I noticed it specifically when loading tokenizers from disk, as the attribute appeared to get lost.

ElleLeonne avatar May 08 '24 19:05 ElleLeonne

cc @ArthurZucker

amyeroberts avatar May 09 '24 10:05 amyeroberts

Thanks! Do you have a small reproducer?

ArthurZucker avatar May 09 '24 10:05 ArthurZucker

Thanks! Do you have a small reproducer?

Oh dear, thank you for asking. It appears I've made a very small mistake and jumped to conclusions early.

Allow me to properly demonstrate the problem. As part of my code, I actually manually override this attribute on the tokenizer.

However, it appears that this method loses track of the change when I perform this, as the attribute vanishes in this snippet and causes a crash.

This will reproduce, and the change causes the code to work as expected and output the expected dialogue, however there may be a deeper root cause as to why the attribute is None here.

from transformers import GemmaTokenizer
import random

tokenizer = GemmaTokenizer.from_pretrained("google/Gemma-7B")
tokenizer.model_max_length = 10

# Generate some hard-to-tokenizer nonsense.
string = ' '.join(random.choices('abcdefghijklmnopqrstuvwxyz', k=20))

text = tokenizer.encoder(string)

ElleLeonne avatar May 09 '24 13:05 ElleLeonne

cc @itazap if you can have a look!

ArthurZucker avatar May 10 '24 08:05 ArthurZucker

Hi @ElleLeonne!

I am unable to reproduce the error with the code snippet provided. I only observe the following warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (21 > 10). Running this sequence through the model will result in indexing errors

After this warning, the code continues to run successfully and the result of text as the full encoded sequence (with length 21). So even though it is too long, the expected behaviour is still to encode this sequence. If your use case require it to be truncated to the max_model_length, you can use tokenizer.encode(string, truncation=True)

If you are experiencing an error / crash, can you please provide a stacktrace?

Thanks!

itazap avatar May 14 '24 10:05 itazap

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 08 '24 08:06 github-actions[bot]

I figured it out.

The tl;dr is that, rather than setting the attribute to None in the code, you instead set the attribute to 1e20.

In my code (for some reason), that value was getting set to None aflter initializing and, even though I could call the variable and get it to print, for some reason there was still a stale reference in the back-end that was pointing to None.

I now instead, choose to initialize the variable as "1e20" as per your source code, and rather than having null conditions, I instead use the original instance of the attribute for defaulting behavior.

Problem solved: the attribute is never able to become None because it never was None.

Weird bug though. Not sure what about my environment causes the deviant behavior.

ElleLeonne avatar Jun 09 '24 14:06 ElleLeonne