texthero icon indicating copy to clipboard operation
texthero copied to clipboard

punctuation not being removed correctly using `preprocessing.clean`

Open aliforgetti opened this issue 3 years ago • 2 comments

This is my code and I was trying to clean a large dataset

full_data['text_pp'] = (
    full_data['text']
    .pipe(hero.preprocessing.clean)
    .pipe(hero.remove_urls)
)

According to the documentation this is the default pipeline for the clean functionality:

Default pipeline:
texthero.preprocessing.fillna()

texthero.preprocessing.lowercase()

texthero.preprocessing.remove_digits()

texthero.preprocessing.remove_punctuation()

texthero.preprocessing.remove_diacritics()

texthero.preprocessing.remove_stopwords()

texthero.preprocessing.remove_whitespace()

But my ouput does not reflect this as some of the punctuation remained in the text.

Original text column image

Preprocessed text column image

aliforgetti avatar Apr 13 '21 20:04 aliforgetti

Hi, could you paste the actual data you're using? (Just one of the texts would help probably).

For me with the beginning of your first text, the punctuation is removed successfully:

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["Honestly people don't know about the fact ..."])
>>> hero.clean(s)
0    honestly people know fact
dtype: object

The issue is probably that some punctuation in your text is not "standard" punctuation (texthero internally uses import string; string.punctuation so if it's not in there it won't be removed

henrifroese avatar Apr 14 '21 10:04 henrifroese

Thank you @henrifroese. @aliforgetti do you have any updates?

jbesomi avatar Apr 16 '21 11:04 jbesomi