preprocessor
preprocessor copied to clipboard
Encoding issue with non-English text
A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters. And this is happening only on version 0.6.0
The cause of this issue seems to be line 50 of preprocess.py
To reproduce: import preprocessor as p p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY) print(p.clean("внесла предложение призвать всех избегать применять незаконные"))
I have found a similar behavior in Spanish. The problem here is that characters with diacritics are removed, but no error is triggered.
For example, the text:
# Note the second letter starting from the left
Sí, efectivamente, el Servicio de Vigilancia
It is transformed into:
S, efectivamente, el Servicio de Vigilancia
In my case, this is the code I have used
df['tweet'] = df['tweet'].apply (lambda x: p.clean (x))
The dataframe is a readed from a CSV file in UTF8 (without BOM)
I think the offending option that destroys everything that is not an English character is: def preprocess_emojis(self, tweet_string, repl): processed = Patterns.EMOJIS_PATTERN.sub(repl, tweet_string) return processed.encode('ascii', 'ignore').decode('ascii') There should be a better way to clean emojis And there is: https://github.com/carpedm20/emoji Maybe this library should be in charge of demoji-fying, although it stubbornly adds aliases, like: :flexed_biceps:
The same issue appears with Arabic texts using windows 10. if the emoji option is on it will delete all characters.
Can confirm the same for Hindi and Codemixed Hindi-English as well.
Also removes all Korean text!