nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

Duplicates in words.words() dictionary

Open tgustavson opened this issue 8 years ago • 2 comments

The words.words() dictionary contains 844 duplicates, which may as well be eliminated. I discovered it because some of them were out of alphabetical order.

Here is some Python to illustrate this:

from nltk.corpus import words len(words.words()) 236736 len(set(words.words())) 235892 236736-235892 844

Thanks.

tgustavson avatar Dec 02 '16 00:12 tgustavson

The words corpus actually contains two files: en with 235886 unique words and en-basic with 850 unique words. Out of these 850 words, 6 are not found in en:

["near", "behaviour", "harbour", "humour", "box", "colour"]

This is why there are 850 - 6 = 844 duplicate words.

The corpus is apparently based on the words dict available on Unix machines, but since there doesn't seem to be a "canonical" dict (/usr/share/dict/words only has 99171 words on my machine), I suggest we delete en-basic and put the 6 words in en instead.

simonrichard avatar Dec 06 '16 17:12 simonrichard

Is there a reason why words.words() is a list, and not, say, a frozenset?

nishkalavallabhi avatar Sep 14 '18 21:09 nishkalavallabhi