nltk_data
nltk_data copied to clipboard
Word missing in words
'children' is missing in english words. I understand some plurals are missing but some irregular plurals are included like 'hippopotami' and 'corpora'. I would find it difficult to programmatically find 'children' from 'child'.
'burger' is also missing, even though there is 'cheeseburger' and 'hamburger'
See https://stackoverflow.com/questions/44449284/nltk-words-corpus-does-not-contain-okay
Modern unix systems seem to have much larger vocabularies, or at least more comprehensive ones. I did a sort of "diff" of Ubuntu 18.04 /usr/share/dict/american-english with what's in the words corpus, and not counting proper nouns or possessives, there are over 31000 additional words. Some of these are trivial, and important, for example: 'failed', 'fails', 'failures', 'succeeds', 'succeeded'
I'm not convinced that just saying "it's a fixed list forever" is really a convincing solution. If it's just a "this is a lot of bother" issue, I've made it simple by attaching a [zip file](url en.comprehensive.zip ) with the original and new words combined.