Inconsistent exception is raised when series containing Nans is passed ro `nlpretext.basic.preprocess.remove_stopwords`
🐛 Bug Report
When using the remove_stopwordsfunction, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).
🔬 How To Reproduce
Steps to reproduce the behavior:
-
load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.
-
Try using remove_stopwords
Code sample
import pandas as pd
from nlpretext.basic.preprocess import remove_stopwords
data = {'overview': {
0: 'Comme les Mousquetaires dont elles possèdent le cran',
1: 'New York, été 1977. Alors que la ville connait une canicule historique, un tueur en série, The Son of Sam, frappe dans le quartier italo-américain de South Bronx.',
2: '',
3: "Félicia, dix-sept ans, traverse la mer d'Irlande, avec pour tout renseignement le nom de la ville où habite son amant pour lui annoncer sa grossesse.",
4: "Arthur Bishop pensait qu'il avait mis son passé de tueur à gages derrière lui. Il coule maintenant des jours heureux avec sa compagne dans l'anonymat."},
'tagline': {0: '', 1: '', 2: '', 3: '', 4: 'Il reprend du service.'}
}
data = pd.DataFrame(data)
data["text"] = data["tagline"] + data["overview"]
data["text"].map(lambda x: remove_stopwords(x, lang='fr'))
Environment
- OS: google colab
- Python version: 3.7
Screenshots
First exception:
Then when replacing 'fr' by 'fr_scpacy':

📈 Expected behavior
remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"
📎 Additional context
Workaround: data["text"] = data["tagline"] + " " + data["overview"] solves it as all rows will be non-empty strings.
Hello @julesbertrand, thank you for your interest in our work!
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.