preprocessor
preprocessor copied to clipboard
Elegant and Easy Tweet Preprocessing in Python
Apologies, I cannot find the `Template on the Issues Page` so I will do my best to describe the feature. I am working on a project where I found this...
**Describe the bug** When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. **To Reproduce** I printed out the following in...
When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. This is my simple fix to resolve https://github.com/s/preprocessor/issues/55 This may create...
**Describe the bug** The tool automaticall removed some characters, such as á é ú í,... in some languages. For example: accidente clichecístico -> accidente clichecstico Here is the code: ```...
A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters. And this is happening only on version 0.6.0 The cause of this issue seems to...
**Is your feature request related to a problem? Please describe.** There is no alternative in Preprocessor to replacing hashtags by a dummy $HASHTAG$ token. This has frustrated some users who...
**Describe the bug** I want to replace URLs in the text with a URL tag, it generally works well but with some input, my code seems to bug with no...
**Describe the bug** ` Running pp.clean('http://google.com/..........................')` takes too much time. Seems like it's a bug. **To Reproduce** run `pp.clean('http://google.com/..........................')` **Expected behavior** It can return: - `'..........................'` - `''` **Desktop (please...
Mentions in tweets often has colons (@user:). Current regex `MENTION_PATTERN = re.compile(r'@\w*')` does not search for colons. I changed to `MENTION_PATTERN = re.compile(r'@\w*:?')` to allow for matching with colons.
The encode('ascii', 'ignore').decode('ascii') strategy does not work for non-English characters. Since emoji regex patterns already exist in defines.py, regex substitute is sufficient to remove the emojis. Fixes #47 and #48