preprocessor issues

Results 16 preprocessor issues

Sort by recently updated

feature: allow tokenization of files

Apologies, I cannot find the `Template on the Issues Page` so I will do my best to describe the feature. I am working on a project where I found this...

EllingtonKirby

`reserved_words` attribute never gets filled, instead an originally undefined attribute `reserved` is created and filled instead.

**Describe the bug** When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. **To Reproduce** I printed out the following in...

rsyarif

bug

Fix: fill `reserved_words` attribute instead the originally undefined attribute `reserved`

When parsing, `reserved_words` attribute never gets filled, instead a new originally undefined attr `reserved` is created and filled instead. This is my simple fix to resolve https://github.com/s/preprocessor/issues/55 This may create...

rsyarif

Bug: remove some latin characters

**Describe the bug** The tool automaticall removed some characters, such as á é ú í,... in some languages. For example: accidente clichecístico -> accidente clichecstico Here is the code: ```...

hoangthangta

bug

Encoding issue with non-English text

A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters. And this is happening only on version 0.6.0 The cause of this issue seems to...

omid-jf

bug

[Feature Proposal] Use hashformers for hashtag segmentation

**Is your feature request related to a problem? Please describe.** There is no alternative in Preprocessor to replacing hashtags by a dummy $HASHTAG$ token. This has frustrated some users who...

ruanchaves

enhancement

Bug when using Tokenize with the URL option

**Describe the bug** I want to replace URLs in the text with a URL tag, it generally works well but with some input, my code seems to bug with no...

lezakkaz

bug

Edge case that takes too much time

**Describe the bug** ` Running pp.clean('http://google.com/..........................')` takes too much time. Seems like it's a bug. **To Reproduce** run `pp.clean('http://google.com/..........................')` **Expected behavior** It can return: - `'..........................'` - `''` **Desktop (please...

kvtoraman

bug

Fix #2 (punctuations): allows for colons in matching mentions

Mentions in tweets often has colons (@user:). Current regex `MENTION_PATTERN = re.compile(r'@\w*')` does not search for colons. I changed to `MENTION_PATTERN = re.compile(r'@\w*:?')` to allow for matching with colons.

wtsong

Fix support for non-English texts

The encode('ascii', 'ignore').decode('ascii') strategy does not work for non-English characters. Since emoji regex patterns already exist in defines.py, regex substitute is sufficient to remove the emojis. Fixes #47 and #48

omid-jf

preprocessor
preprocessor copied to clipboard

Metadata

feature: allow tokenization of files

`reserved_words` attribute never gets filled, instead an originally undefined attribute `reserved` is created and filled instead.

Fix: fill `reserved_words` attribute instead the originally undefined attribute `reserved`

Bug: remove some latin characters

Encoding issue with non-English text

[Feature Proposal] Use hashformers for hashtag segmentation

Bug when using Tokenize with the URL option

Edge case that takes too much time

Fix #2 (punctuations): allows for colons in matching mentions

Fix support for non-English texts

← Metadata

Owner

Metadata

preprocessor preprocessor copied to clipboard

Metadata

← Metadata

Owner

Metadata

preprocessor
preprocessor copied to clipboard