Inconsistent exception is raised when series containing Nans is passed ro `nlpretext.basic.preprocess.remove_stopwords`

Open julesbertrand opened this issue 3 years ago • 1 comments

🐛 Bug Report

When using the remove_stopwordsfunction, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).

🔬 How To Reproduce

Steps to reproduce the behavior:

load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.
Try using remove_stopwords

Code sample

import pandas as pd
from nlpretext.basic.preprocess import remove_stopwords

data = {'overview': {
  0: 'Comme les Mousquetaires dont elles possèdent le cran',
  1: 'New York, été 1977. Alors que la ville connait une canicule historique, un tueur en série, The Son of Sam, frappe dans le quartier italo-américain de South Bronx.',
  2: '',
  3: "Félicia, dix-sept ans, traverse la mer d'Irlande, avec pour tout renseignement le nom de la ville où habite son amant pour lui annoncer sa grossesse.",
  4: "Arthur Bishop pensait qu'il avait mis son passé de tueur à gages derrière lui. Il coule maintenant des jours heureux avec sa compagne dans l'anonymat."},
 'tagline': {0: '', 1: '', 2: '', 3: '', 4: 'Il reprend du service.'}
}

data = pd.DataFrame(data)

data["text"] = data["tagline"] +  data["overview"]

data["text"].map(lambda x: remove_stopwords(x, lang='fr'))

Environment

OS: google colab
Python version: 3.7

Screenshots

First exception: Capture d’écran 2022-03-22 à 15 52 41 Then when replacing 'fr' by 'fr_scpacy':

📈 Expected behavior

remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"

📎 Additional context

Workaround: data["text"] = data["tagline"] + " " + data["overview"] solves it as all rows will be non-empty strings.

Mar 22 '22 14:03 julesbertrand

Hello @julesbertrand, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

Mar 22 '22 14:03 github-actions[bot]