tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Difference in behavior between fast tokenizers and normal tokenizers regarding unicode characters in strings

Open avi-jain opened this issue 2 years ago • 4 comments

Hello, we recently switched from Python based tokenizers to the newer fast tokenizers and noticed some of our code breaking when the inputs contain unicode (emojis etc) characters, which wasn't an issue earlier. To reproduce -

For normal tokenizers

test_string = ['bath', '&', 'bloom', 'mango', 'tangerine', 'shampoo', '250', 'ml', '\ud83c\udf37']

tokenizer_slow(test_string, add_special_tokens=True, is_split_into_words=True, truncation=True)

Output is: {'input_ids': [101, 7198, 1004, 13426, 24792, 9745, 24226, 25850, 24667, 5539, 19875, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

For fast tokenizers

test_string = ['bath', '&', 'bloom', 'mango', 'tangerine', 'shampoo', '250', 'ml', '\ud83c\udf37']

tokenizer(test_string, add_special_tokens=True, is_split_into_words=True, truncation=True)

Output is: TypeError: PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]

Additional stack trace

~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2473             )
   2474         else:
-> 2475             return self.encode_plus(
   2476                 text=text,
   2477                 text_pair=text_pair,

~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2546         )
   2547 
-> 2548         return self._encode_plus(
   2549             text=text,
   2550             text_pair=text_pair,

~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    496 
    497         batched_input = [(text, text_pair)] if text_pair else [text]
--> 498         batched_output = self._batch_encode_plus(
    499             batched_input,
    500             is_split_into_words=is_split_into_words,

~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    423         )
    424 
--> 425         encodings = self._tokenizer.encode_batch(
    426             batch_text_or_text_pairs,
    427             add_special_tokens=add_special_tokens,
    ```

avi-jain avatar Jun 15 '22 17:06 avi-jain

Hi @avi-jain,

Do you mind sharing which tokenizer it is you are using ?

It's hard to help you without knowing what kind of tokenizer you are using and if the parameters are correctly set.

Ps: Fyi: '\ud83c\udf37' does not seem valid utf-8 which is necessary for tokenizers to work correctly.

Narsil avatar Jul 04 '22 14:07 Narsil

Oh sure AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

avi-jain avatar Jul 13 '22 01:07 avi-jain

Ok.

\ud83c\udf37. is NOT valid utf-8. Bert normalizes strings before treating them hence discards those characters.

In [11]: tokenizer('\ud83c\udf37')
Out[11]: {'input_ids': [101, 102], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}     # SEP CLS, no trace of the input.

When you're using is_split_into_words you are bypassing ENTIRELY the normalization and preprocessing of the tokenizer (meaning you're extremely likely to trigger bugs and have actually incorrect ids). Since the string is invalid utf-8 and the tokenizer expects valid utf-8 (which would have been normalized otherwise) then it crashes as expected.

In my personal opinion, I would suggest to drop is_split_into_words and do directly: `

In [16]: tokenizer('bath & bloom mango tangerine shampoo 250 ml \ud83c\udf37')
Out[16]: {'input_ids': [101, 7198, 1004, 13426, 24792, 9745, 24226, 25850, 24667, 5539, 19875, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

This will always work and really represents what the model is supposed to see. I know there are some use cases to use is_split_into_words but it's usually extremely specific and comes with a lot of caveats, the exception you see is one of them. Using offsets is the only way that I know of to treat in a coherent manner two different sets of tokenization which would allow using the same code with different tokenizers. (or comparing with pre tokenized datasets for instance).

I hope that helps. Cheers.

Narsil avatar Jul 15 '22 11:07 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 15 '24 01:02 github-actions[bot]