How many words should be in the word lists?
Trying to find a suitable word list for Swedish made me realize how difficult it is to select one to use, and how important it is for the game to be fun. There are all sorts of lists, from small ones of 200 or 1000 common words, to lists of 50k or 500k. There are also automated frequency lists from subtitles like this project, in very many languages: OpenSubtitle FrequencyWords that could be awesome if they could just be imported and be a base for having tons of languages. But there are tons of weird stuff in these lists.
What is a reasonable limit? 500k is likely too many. SAOL, the definitive Swedish authority on spelling, has about 125 000 words, but has unclear licensing terms. Currently the GB dict has 50 000 words, the US one 70 000 words, while Polish has 428 000. Having a reasonably similar amount of words is likely a good idea, as it provides a similar play experience over different languages, with similar needs for options and defaults.
There are so many word games, that it should be possible to reuse something that already exists, in many languages. But after searching for a while I cannot find anything. What are the commercial word games using?
This seems like a quite good list of 1000 most common words in English: https://simple.wiktionary.org/wiki/Wiktionary:Most_frequent_1000_words_in_English
There are many other similar lists on Wikitionary, both for English and many other languages, some based on frequency lists from subtitles. For a small set this might be good, as it seems to avoid the weird stuff that is more infrequent.
Aspell lists are pretty old for some languages. For Swedish the last update is from 2004, so not ideal.
For reference, here is an overview of dictionaries: https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words This is not the same as word list length though, as word lists need more forms.
#106
Maybe it is wrong to think about word lists only in terms of strict length. Grouping words in separate lists, and allowing them to be used or not separately, could open up a lot of interesting new possibilities.
Perhaps maintaining a set of word lists would be really useful as a separate repo, that can be re-used for many word games and more people could help to build and maintain it. Letter frequency etc per language would likely be useful for many types of games.
Here are some categories that would be interesting to have: Beginner words (200) Common words (1000) Expanded common words (10 000) City, country, continent, area names Flora names Fauna names Celebrity names Hard/unusual words Abbreviations Massive list of everything
This seems like the best existing collection of dictionaries, as identified in other issues: https://github.com/tube42/wordlists
Discussion there: https://github.com/tube42/wordlists/issues/2
The most comprehensive word list project seems to be http://wordlist.aspell.net/ , with the 3of6game list being the one adapted for situations like Lexica: http://wordlist.aspell.net/12dicts-readme/#3of6game from #58 . It is only for English, but the choices made there might help us with deciding what would be suitable goals here for other languages as well.
The 3of6game list contains 65k words, in good formatting for use here. It includes variations on words, for example:
abort aborted aborting abortion abortionist abortionists abortions abortive aborts
For most common users, this is likely the
The source code for these wordlists are at https://github.com/en-wl/wordlist
The word samples on http://wordlist.aspell.net/scowl-readme/ might help with discussion as well. Level 90 includes really archaic words (total 210k words), but for full scrabble mode they should be included:
adlumidine alinasal aramina begreen bembixes boundly cannibalean
For most users, a level around 60 is probably suitable, so that the words are at least somewhat regularly used:
absurdists botanic cascaras charterer chestier commonsense equitation
The categorizations used there are:
Except for the special word lists the files follow the following
naming convention:
<spelling category>-<sub-category>.<size>
Where the spelling category is one of
english, american, british, british_z, canadian, australian
variant_1, variant_2, variant_3,
british_variant_1, british_variant_2,
canadian_variant_1, canadian_variant_2,
australian_variant_1, australian_variant_2
Sub-category is one of
abbreviations, contractions, proper-names, upper, words
And size is one of
10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large),
80 (huge), 95 (insane)
The special word lists follow are in the following format:
special-<description>.<size>
Where description is one of:
roman-numerals, hacker
There seems to be huge variation internationally in number of words accepted for scrabble, according to a post here: https://boardgames.stackexchange.com/questions/25243/how-many-allowed-scrabble-words-are-there-in-different-languages
Dutch: 652k English: 279k French: 393k German: 180k
I did some comparisons between the current en_GB and en_US lists used in the game, and the 3of6game list. For example, for length 3 words, there are a noticeable difference.
en_US: 930 en_GB: 821 3of6game: 639
Comparing the words starting with z, of length 3:
Lexica-US | Lexica-GB | Game3of6
zap | zap | zap
zea | |
zed | zed |
| zen |
zee | |
zel | |
zig | |
zip | zip | zip
zit | zit | zit
zoa | |
zoo | zoo | zoo
zzz | |
The focus of 3of6game is to list more common words, and it is clear that it succeeds with this. For example, in lexica-us the word zel is defined by wiktionary as "Alternative form of zill", but "zill" is not even included in lexica-us. For anyone interested, wiktionary defines "zill" as "One of a set of small finger cymbals used in belly dancing and similar performances."
"zed" and "zen" are the 2 that I would miss from 3of6game, but I would be happy to remove zea, zee, zel, zoa and zzz.
zig might have a place, but only if zag is in the list as well, which it is not right now. Having just one of them makes no sense, from what I can read up on.
Looking at 3-length words starting with x: 3of6game has none. Lexica-US has one, "xis", which is quite questionable to include. Lexica-GB has 10:
xci
xii
xis
xiv
xix
xor
xvi
xxi
xxv
xxx
Mostly roman numerals, I recommend purging all of them.
Mainly my conclusion is that we need a definition of what words should be included, as a start, so we can have the same goals for all languages to find or make suitable lists. Without this, it is hard to discuss what words should be included. The definition should be listed on https://github.com/lexica/lexica/blob/master/assets/dictionaries/README.md
Reading up on the 12dicts set, the recommendation if you want to get a bigger list for word games is to combine 3of6game with 2of12inf. Doing this adds "zed" and "xis" to the above list, which would fit with the best ones of the current sets.