preprocessor clean up HTML entities

clean up HTML entities

Open etiennekintzler opened this issue 4 years ago • 0 comments

Hello,

Thanks for this convenient library ! 😄

Wouldn't be nice desirable to add a regex that can also clean up HTML special entities such as "&" , ">", etc. (full list here) that are often present in the tweet.

The regex is quite straightforward

HTML_ENTITIES_PATTERN =  re.compile(r'&[a-zA-Z]+;')

(+ small changes necessary in defines.py ).

I could do a PR but I don't know how it integrates in the other class of the file (except for Patterns).

Oct 22 '20 14:10 etiennekintzler

preprocessor preprocessor copied to clipboard

clean up HTML entities

preprocessor
preprocessor copied to clipboard