nltk_data
nltk_data copied to clipboard
Duplicates in words.words() dictionary
The words.words() dictionary contains 844 duplicates, which may as well be eliminated. I discovered it because some of them were out of alphabetical order.
Here is some Python to illustrate this:
from nltk.corpus import words len(words.words()) 236736 len(set(words.words())) 235892 236736-235892 844
Thanks.
The words
corpus actually contains two files: en
with 235886 unique words and en-basic
with 850 unique words. Out of these 850 words, 6 are not found in en
:
["near", "behaviour", "harbour", "humour", "box", "colour"]
This is why there are 850 - 6 = 844 duplicate words.
The corpus is apparently based on the words
dict available on Unix machines, but since there doesn't seem to be a "canonical" dict (/usr/share/dict/words
only has 99171 words on my machine), I suggest we delete en-basic
and put the 6 words in en
instead.
Is there a reason why words.words() is a list, and not, say, a frozenset?