Consolidate the EMOJI_PATTERN Unicode blocks and expand to include dingbats and more

Open dagelf opened this issue 2 years ago • 1 comments

Consolidate the Unicode blocks and expand to include dingbats and Japanese outdoor signage emojis.

This goes into what seems to be BERT-base-uncased, which is a pretrained model that uses Unicode as its input encoding. It can handle any Unicode character, such as emojis, as long as it is within the range of U+0000 to U+FFFF. If no such characters are part of the training data, it will just clutter and slow down the inference. (https://huggingface.co/bert-base-uncased). But I presume the model is expanded with emojis, and that is the reason for this simply adding spaces between them, although it's worth checking if anything above U+FFFF will actually have any effect.

Also consider using UnicodeCharacterTokenizer or tf.strings.unicode_encode and decode, and also consider training on emojis and accented characters. (Not sure where to log this)

CLA Assistant is rate limited, but I hereby accept it.

Apr 01 '23 12:04 dagelf

All committers have signed the CLA.

Apr 01 '23 14:04 CLAassistant