lexica icon indicating copy to clipboard operation
lexica copied to clipboard

How many words should be in the word lists?

Open RustanHakansson opened this issue 5 years ago • 8 comments

Trying to find a suitable word list for Swedish made me realize how difficult it is to select one to use, and how important it is for the game to be fun. There are all sorts of lists, from small ones of 200 or 1000 common words, to lists of 50k or 500k. There are also automated frequency lists from subtitles like this project, in very many languages: OpenSubtitle FrequencyWords that could be awesome if they could just be imported and be a base for having tons of languages. But there are tons of weird stuff in these lists.

What is a reasonable limit? 500k is likely too many. SAOL, the definitive Swedish authority on spelling, has about 125 000 words, but has unclear licensing terms. Currently the GB dict has 50 000 words, the US one 70 000 words, while Polish has 428 000. Having a reasonably similar amount of words is likely a good idea, as it provides a similar play experience over different languages, with similar needs for options and defaults.

There are so many word games, that it should be possible to reuse something that already exists, in many languages. But after searching for a while I cannot find anything. What are the commercial word games using?

RustanHakansson avatar Sep 03 '20 07:09 RustanHakansson

This seems like a quite good list of 1000 most common words in English: https://simple.wiktionary.org/wiki/Wiktionary:Most_frequent_1000_words_in_English

There are many other similar lists on Wikitionary, both for English and many other languages, some based on frequency lists from subtitles. For a small set this might be good, as it seems to avoid the weird stuff that is more infrequent.

Aspell lists are pretty old for some languages. For Swedish the last update is from 2004, so not ideal.

For reference, here is an overview of dictionaries: https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words This is not the same as word list length though, as word lists need more forms.

RustanHakansson avatar Sep 03 '20 08:09 RustanHakansson

#106

RustanHakansson avatar Sep 03 '20 08:09 RustanHakansson

Maybe it is wrong to think about word lists only in terms of strict length. Grouping words in separate lists, and allowing them to be used or not separately, could open up a lot of interesting new possibilities.

Perhaps maintaining a set of word lists would be really useful as a separate repo, that can be re-used for many word games and more people could help to build and maintain it. Letter frequency etc per language would likely be useful for many types of games.

Here are some categories that would be interesting to have: Beginner words (200) Common words (1000) Expanded common words (10 000) City, country, continent, area names Flora names Fauna names Celebrity names Hard/unusual words Abbreviations Massive list of everything

RustanHakansson avatar Sep 03 '20 08:09 RustanHakansson

This seems like the best existing collection of dictionaries, as identified in other issues: https://github.com/tube42/wordlists

Discussion there: https://github.com/tube42/wordlists/issues/2

RustanHakansson avatar Sep 05 '20 13:09 RustanHakansson

The most comprehensive word list project seems to be http://wordlist.aspell.net/ , with the 3of6game list being the one adapted for situations like Lexica: http://wordlist.aspell.net/12dicts-readme/#3of6game from #58 . It is only for English, but the choices made there might help us with deciding what would be suitable goals here for other languages as well.

The 3of6game list contains 65k words, in good formatting for use here. It includes variations on words, for example:

abort aborted aborting abortion abortionist abortionists abortions abortive aborts

For most common users, this is likely the

The source code for these wordlists are at https://github.com/en-wl/wordlist

The word samples on http://wordlist.aspell.net/scowl-readme/ might help with discussion as well. Level 90 includes really archaic words (total 210k words), but for full scrabble mode they should be included:

adlumidine alinasal aramina begreen bembixes boundly cannibalean

For most users, a level around 60 is probably suitable, so that the words are at least somewhat regularly used:

absurdists botanic cascaras charterer chestier commonsense equitation

The categorizations used there are:

Except for the special word lists the files follow the following
naming convention:
  <spelling category>-<sub-category>.<size>
Where the spelling category is one of
  english, american, british, british_z, canadian, australian
  variant_1, variant_2, variant_3,
  british_variant_1, british_variant_2, 
  canadian_variant_1, canadian_variant_2,
  australian_variant_1, australian_variant_2
Sub-category is one of
  abbreviations, contractions, proper-names, upper, words
And size is one of
  10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large), 
  80 (huge), 95 (insane)
The special word lists follow are in the following format:
  special-<description>.<size>
Where description is one of:
  roman-numerals, hacker

RustanHakansson avatar Sep 05 '20 14:09 RustanHakansson

There seems to be huge variation internationally in number of words accepted for scrabble, according to a post here: https://boardgames.stackexchange.com/questions/25243/how-many-allowed-scrabble-words-are-there-in-different-languages

Dutch: 652k English: 279k French: 393k German: 180k

RustanHakansson avatar Sep 06 '20 10:09 RustanHakansson

I did some comparisons between the current en_GB and en_US lists used in the game, and the 3of6game list. For example, for length 3 words, there are a noticeable difference.

en_US: 930 en_GB: 821 3of6game: 639

Comparing the words starting with z, of length 3:

Lexica-US | Lexica-GB | Game3of6
zap | zap | zap
zea |     |  
zed | zed |  
    | zen |  
zee |     |  
zel |     |  
zig |     |  
zip | zip | zip
zit | zit | zit
zoa |     |  
zoo | zoo | zoo
zzz |     |  

The focus of 3of6game is to list more common words, and it is clear that it succeeds with this. For example, in lexica-us the word zel is defined by wiktionary as "Alternative form of zill", but "zill" is not even included in lexica-us. For anyone interested, wiktionary defines "zill" as "One of a set of small finger cymbals used in belly dancing and similar performances."

"zed" and "zen" are the 2 that I would miss from 3of6game, but I would be happy to remove zea, zee, zel, zoa and zzz.

zig might have a place, but only if zag is in the list as well, which it is not right now. Having just one of them makes no sense, from what I can read up on.

Looking at 3-length words starting with x: 3of6game has none. Lexica-US has one, "xis", which is quite questionable to include. Lexica-GB has 10:

xci
xii
xis
xiv
xix
xor
xvi
xxi
xxv
xxx

Mostly roman numerals, I recommend purging all of them.

Mainly my conclusion is that we need a definition of what words should be included, as a start, so we can have the same goals for all languages to find or make suitable lists. Without this, it is hard to discuss what words should be included. The definition should be listed on https://github.com/lexica/lexica/blob/master/assets/dictionaries/README.md

RustanHakansson avatar Sep 14 '20 09:09 RustanHakansson

Reading up on the 12dicts set, the recommendation if you want to get a bigger list for word games is to combine 3of6game with 2of12inf. Doing this adds "zed" and "xis" to the above list, which would fit with the best ones of the current sets.

RustanHakansson avatar Sep 14 '20 10:09 RustanHakansson