tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

`truncation='do_not_truncate'` is not working equivalently to `truncation=False`

Open urialon opened this issue 2 years ago • 0 comments

Hi, truncation='do_not_truncate' is not working equivalently to truncation=False. When using truncation=False and provoding max_length, it defaults to 'longest_first' truncation strategy. Whether this default behavior is natural or not, isn't False supposed to be identical to 'do_not_truncate'?

This leads to a situation when the user explicitly specifies truncation=False but the text is tokenized.

This manual: https://huggingface.co/docs/transformers/pad_truncation and this doc https://huggingface.co/docs/transformers/main_classes/tokenizer say that:

False or 'do_not_truncate': no truncation is applied. This is the default behavior.

Which means that they are supposed to be equivalent (regardless of what they do, they should behave the same).

I suggest that False should just mean "no truncation", regardless of max_length was supplied or not.

Here is a short example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
sent = 'The quick brown fox jumps over the lazy dog'

len(tokenizer.encode(sent, max_length=5, truncation='do_not_truncate'))

prints: 11

len(tokenizer.encode(sent, max_length=5, truncation=False))

prints: 5

Thanks, Uri

urialon avatar Sep 21 '22 19:09 urialon