preprocessor
preprocessor copied to clipboard
clean up HTML entities
Hello,
Thanks for this convenient library ! 😄
Wouldn't be nice desirable to add a regex that can also clean up HTML special entities such as "&" , ">", etc. (full list here) that are often present in the tweet.
The regex is quite straightforward
HTML_ENTITIES_PATTERN = re.compile(r'&[a-zA-Z]+;')
(+ small changes necessary in defines.py
).
I could do a PR but I don't know how it integrates in the other class of the file (except for Patterns
).