nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

Word missing in words

Open lukeellison opened this issue 8 years ago • 3 comments

'children' is missing in english words. I understand some plurals are missing but some irregular plurals are included like 'hippopotami' and 'corpora'. I would find it difficult to programmatically find 'children' from 'child'.

lukeellison avatar Jan 30 '17 12:01 lukeellison

'burger' is also missing, even though there is 'cheeseburger' and 'hamburger'

lukeellison avatar Jan 30 '17 15:01 lukeellison

See https://stackoverflow.com/questions/44449284/nltk-words-corpus-does-not-contain-okay

alvations avatar Jul 24 '17 09:07 alvations

Modern unix systems seem to have much larger vocabularies, or at least more comprehensive ones. I did a sort of "diff" of Ubuntu 18.04 /usr/share/dict/american-english with what's in the words corpus, and not counting proper nouns or possessives, there are over 31000 additional words. Some of these are trivial, and important, for example: 'failed', 'fails', 'failures', 'succeeds', 'succeeded'

I'm not convinced that just saying "it's a fixed list forever" is really a convincing solution. If it's just a "this is a lot of bother" issue, I've made it simple by attaching a [zip file](url en.comprehensive.zip ) with the original and new words combined.

GregIthaca avatar Jan 14 '22 17:01 GregIthaca