tokenizers
tokenizers copied to clipboard
Difference in behavior between fast tokenizers and normal tokenizers regarding unicode characters in strings
Hello, we recently switched from Python based tokenizers to the newer fast tokenizers and noticed some of our code breaking when the inputs contain unicode (emojis etc) characters, which wasn't an issue earlier. To reproduce -
For normal tokenizers
test_string = ['bath', '&', 'bloom', 'mango', 'tangerine', 'shampoo', '250', 'ml', '\ud83c\udf37']
tokenizer_slow(test_string, add_special_tokens=True, is_split_into_words=True, truncation=True)
Output is:
{'input_ids': [101, 7198, 1004, 13426, 24792, 9745, 24226, 25850, 24667, 5539, 19875, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
For fast tokenizers
test_string = ['bath', '&', 'bloom', 'mango', 'tangerine', 'shampoo', '250', 'ml', '\ud83c\udf37']
tokenizer(test_string, add_special_tokens=True, is_split_into_words=True, truncation=True)
Output is:
TypeError: PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]
Additional stack trace
~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2473 )
2474 else:
-> 2475 return self.encode_plus(
2476 text=text,
2477 text_pair=text_pair,
~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2546 )
2547
-> 2548 return self._encode_plus(
2549 text=text,
2550 text_pair=text_pair,
~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
496
497 batched_input = [(text, text_pair)] if text_pair else [text]
--> 498 batched_output = self._batch_encode_plus(
499 batched_input,
500 is_split_into_words=is_split_into_words,
~/Library/Python/3.8/lib/python/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
423 )
424
--> 425 encodings = self._tokenizer.encode_batch(
426 batch_text_or_text_pairs,
427 add_special_tokens=add_special_tokens,
```
Hi @avi-jain,
Do you mind sharing which tokenizer it is you are using ?
It's hard to help you without knowing what kind of tokenizer you are using and if the parameters are correctly set.
Ps: Fyi: '\ud83c\udf37'
does not seem valid utf-8 which is necessary for tokenizers to work correctly.
Oh sure
AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Ok.
\ud83c\udf37
. is NOT valid utf-8.
Bert normalizes strings before treating them hence discards those characters.
In [11]: tokenizer('\ud83c\udf37')
Out[11]: {'input_ids': [101, 102], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]} # SEP CLS, no trace of the input.
When you're using is_split_into_words
you are bypassing ENTIRELY the normalization and preprocessing of the tokenizer (meaning you're extremely likely to trigger bugs and have actually incorrect ids).
Since the string is invalid utf-8 and the tokenizer expects valid utf-8 (which would have been normalized otherwise) then it crashes as expected.
In my personal opinion, I would suggest to drop is_split_into_words
and do directly: `
In [16]: tokenizer('bath & bloom mango tangerine shampoo 250 ml \ud83c\udf37')
Out[16]: {'input_ids': [101, 7198, 1004, 13426, 24792, 9745, 24226, 25850, 24667, 5539, 19875, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
This will always work and really represents what the model is supposed to see. I know there are some use cases to use is_split_into_words
but it's usually extremely specific and comes with a lot of caveats, the exception you see is one of them.
Using offsets
is the only way that I know of to treat in a coherent manner two different sets of tokenization which would allow using the same code with different tokenizers. (or comparing with pre tokenized datasets for instance).
I hope that helps. Cheers.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.