english-words icon indicating copy to clipboard operation
english-words copied to clipboard

words.txt lacks words that are in words_alpha.txt

Open carlosaguilarmelchor opened this issue 4 years ago • 2 comments

Example :

# cat words_alpha.txt|grep ^ned                                        
ned
nedder
neddy
neddies
nederlands
# cat words.txt|grep ^ned
nedder
neddies
#

The documentation states that words_alpha.txt is a subset from words.txt which apparently is not the case as of now.

carlosaguilarmelchor avatar Jan 13 '21 09:01 carlosaguilarmelchor

I think this is just a case sensitivity issue.

$ cat words.txt|grep -i ^ned
NED
Neda
NEDC
Nedda
nedder
Neddy
Neddie
neddies
Neddra
Nederland
Nederlands
Nedi
Nedra
Nedrah
Nedry
Nedrow
Nedrud

While it would be nice for these files to be perfectly formatted, this is a good reminder to clean your data before doing calculations.

adsteel avatar Feb 06 '22 16:02 adsteel

This problem does exist, however. I found 25 missing words with these python3 commands (pasted here for reference):

> import requests
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words.txt')
> r.status_code
200
> w = set(r.text.lower().split())
> len(w)
466546
> r = requests.get('https://raw.githubusercontent.com/dwyl/english-words/words_alpha.txt')
> r.status_code
200
> wa = set(r.text.lower().split())
> len(wa)
370103
> missing = wa - w
> len(missing)
25
> missing
{'preinferredpreinferring', 'stegnosisstegnotic', 'tangantangan', 'false', 'sturdiersturdies', 'peroxidicperoxiding', 'gynecicgynecidal', 'coevolvedcoevolves', 'preobtrudingpreobtrusion', 'kestrelkestrels', 'aliyahaliyahs', 'coracoprocoracoid', 'cylindrocylindric', 'killeekillee', 'antinganting', 'epigonousepigons', 'snailfishessnailflower', 'outwardsoutwarred', 'regeneratoryregeneratress', 'cryptocurrency', 'quadriquadric', 'subsultorysubsultus', 'brigantinebrigantines', 'caducecaducean', 'hypophypophysism'}

Note that there's this other problem of there seemingly being several words that have been merged together somehow, but it's also true that not all words in words_alpha.txt are in words.txt (ex "false").

JaviSorribes avatar Mar 10 '22 02:03 JaviSorribes