nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

wrong german stopwords in stopwords corpora

Open juh2 opened this issue 10 years ago • 3 comments
trafficstars

nltk_data/packages/corpora/stopwords.zip contains four wrong german stopwords:

unse
unsem
unsen
unses

juh2 avatar Jan 16 '15 08:01 juh2

The "non-words" raised by @juh2 should have been resolved in #49

>>> from nltk.corpus import stopwords
>>> deu_stops = stopwords.words('german')
>>> 'unse' in deu_stops
False
>>> 'unsem' in deu_stops
False
>>> 'unsen' in deu_stops
False
>>> 'unses' in deu_stops
False
>>> 'unsere' in deu_stops # valid stopwords.
True

But there are more stopwords missing for germans, to list a few:

>>> 'unserige' in deu_stops
False
>>> 'unserins' in deu_stops
False
>>> 'unseriner' in deu_stops
False

alvations avatar May 09 '17 14:05 alvations

"unserins" und "unseriner" are not German words. Do you mean "unsereins" and "unsereiner"?

hebecked avatar Jan 11 '21 08:01 hebecked

Please propose a definitive list of German stopwords and I will update our list.

stevenbird avatar Jul 04 '22 05:07 stevenbird