preprocessor
preprocessor copied to clipboard
Fix support for non-English texts
The encode('ascii', 'ignore').decode('ascii') strategy does not work for non-English characters. Since emoji regex patterns already exist in defines.py, regex substitute is sufficient to remove the emojis.
Fixes #47 and #48
The pattern defined in defines.py does not contain newer emojis though and needs to be updated. emoji.get_emoji_regexp() from https://pypi.org/project/emoji can be used instead as well.