preprocessor icon indicating copy to clipboard operation
preprocessor copied to clipboard

Elegant and Easy Tweet Preprocessing in Python

Results 16 preprocessor issues
Sort by recently updated
recently updated
newest added

Apologies, I cannot find the `Template on the Issues Page` so I will do my best to describe the feature. I am working on a project where I found this...

**Describe the bug** When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. **To Reproduce** I printed out the following in...

bug

When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. This is my simple fix to resolve https://github.com/s/preprocessor/issues/55 This may create...

**Describe the bug** The tool automaticall removed some characters, such as á é ú í,... in some languages. For example: accidente clichecístico -> accidente clichecstico Here is the code: ```...

bug

A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters. And this is happening only on version 0.6.0 The cause of this issue seems to...

bug

**Is your feature request related to a problem? Please describe.** There is no alternative in Preprocessor to replacing hashtags by a dummy $HASHTAG$ token. This has frustrated some users who...

enhancement

**Describe the bug** I want to replace URLs in the text with a URL tag, it generally works well but with some input, my code seems to bug with no...

bug

**Describe the bug** ` Running pp.clean('http://google.com/..........................')` takes too much time. Seems like it's a bug. **To Reproduce** run `pp.clean('http://google.com/..........................')` **Expected behavior** It can return: - `'..........................'` - `''` **Desktop (please...

bug

Mentions in tweets often has colons (@user:). Current regex `MENTION_PATTERN = re.compile(r'@\w*')` does not search for colons. I changed to `MENTION_PATTERN = re.compile(r'@\w*:?')` to allow for matching with colons.

The encode('ascii', 'ignore').decode('ascii') strategy does not work for non-English characters. Since emoji regex patterns already exist in defines.py, regex substitute is sufficient to remove the emojis. Fixes #47 and #48