tokenizers
tokenizers copied to clipboard
`truncation='do_not_truncate'` is not working equivalently to `truncation=False`
Hi,
truncation='do_not_truncate'
is not working equivalently to truncation=False
.
When using truncation=False
and provoding max_length
, it defaults to 'longest_first'
truncation strategy.
Whether this default behavior is natural or not, isn't False
supposed to be identical to 'do_not_truncate'
?
This leads to a situation when the user explicitly specifies truncation=False
but the text is tokenized.
This manual: https://huggingface.co/docs/transformers/pad_truncation and this doc https://huggingface.co/docs/transformers/main_classes/tokenizer say that:
False
or'do_not_truncate'
: no truncation is applied. This is the default behavior.
Which means that they are supposed to be equivalent (regardless of what they do, they should behave the same).
I suggest that False
should just mean "no truncation", regardless of max_length
was supplied or not.
Here is a short example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
sent = 'The quick brown fox jumps over the lazy dog'
len(tokenizer.encode(sent, max_length=5, truncation='do_not_truncate'))
prints: 11
len(tokenizer.encode(sent, max_length=5, truncation=False))
prints: 5
Thanks, Uri