Consolidate the EMOJI_PATTERN Unicode blocks and expand to include dingbats and more
Consolidate the Unicode blocks and expand to include dingbats and Japanese outdoor signage emojis.
This goes into what seems to be BERT-base-uncased, which is a pretrained model that uses Unicode as its input encoding. It can handle any Unicode character, such as emojis, as long as it is within the range of U+0000 to U+FFFF. If no such characters are part of the training data, it will just clutter and slow down the inference. (https://huggingface.co/bert-base-uncased). But I presume the model is expanded with emojis, and that is the reason for this simply adding spaces between them, although it's worth checking if anything above U+FFFF will actually have any effect.
Also consider using UnicodeCharacterTokenizer or tf.strings.unicode_encode and decode, and also consider training on emojis and accented characters. (Not sure where to log this)
CLA Assistant is rate limited, but I hereby accept it.