New Russian dictionary built from ru.wiktionary.org
Hi, thanks for Lexica.
However, Russian dictionaries in the game are rather bad:
- Russian dictionary is rather small (only ~34k words) and misses so many ordinary words.
- Russian (Extended) is larger (~493k words) but includes grammar cases. (Lexical games in Russia usually accept only nominative case of nouns.)
I would propose another dictionary of ~114k words (see the attachment). I built it from Russian Wiktionary (this site is used by Lexica to get word definitions) by listing "Русские_существительные" (Russian nouns) category and removing:
- compound words (with dash, e. g. яхт-клуб);
- proper names (with capital letters, e. g. Иван).
(Actually, I left only words consisting of small letters only). Since the dictionary is built from Wiktionary, it's license is Creative Commons Attribution-ShareAlike 3.0 Unported License.
BTW, In this dictionary Е and Ё — different letters. They are not interchangeable. They should have different cost, since Ё is much more rarely used than Е: 3k occurrences for Ё and more than 87k occurrences for Е.
Thank you very much for your time and effort. I did find some non-existent words in the dictionary you had compiled, as well as the words that are numerals, pronouns, and not nouns. After compiling the list of Russian nouns myself and going through yours I ended up with the dictionary found at https://github.com/Harrix/Russian-Nouns under MIT license. It has got both the word list and the definitions.
I had removed compound words from it along with some pronouns and the numerals, and had generated a distribution from that list. It comes up with very nice boards, and the dictionary is about 57,000 words.
I'm leaving the letter "Ё" for now on account that it is crucial to RSL (Russian as Second Language) players. And keeping the letter "Ё" did not affect the gaming experience so far.
Having said that, out of 2099 nouns with "ё" 59 have alternative pronunciation and spelling with "е". These alternative spellings included in the dictionary. The remaining 2040 nouns present no challenge to the native speaker if spelled with "е" rather than "ё". For this very reason the letter "Ё" is now considered to be optional in modern spelling. There are 7 nouns however that depnd on the use of the letter "ё". These are totally different words with different meanings if spelled with "е". E.g. небо (the sky) - нёбо (palate), падеж (noun case) - падёж (loss of livestock) etc.
And sometimes where the use of "ё" is optional a noun may change to an adverb like далёко (the future) - далеко (far away).
So as I said, I had decided to keep "Ё" for now even though most people I had asked about it strongly object. I am also keeping derivatives, as we are used to have those in the dictionary Lexica was shipped with for aver a year. I will remove all diminutive suffixes forms in the future for sure. This is a manual process, as there are hat look alike but must be present in the dictionary on account those are the normal nouns.
The dictionary can be found in my pull request. I'll do my best to keep it up to date based on user feedback.
Hi, @OpenWarp, I like your work. Expanding the wordlist 1.5x is great!
So as I said, I had decided to keep "Ё" for now even though most people I had asked about it strongly object.
Yep ))) To make the right decision you need to clearly understand who your users are. RFL users don't need and don't like ё-s in Lexica-like apps. RSL users may like them to improve the tongue. Actually Lexica allows both (or 3,4,...) dictionaries as we have right now. Maybe that would satisfy both users' categories?
Also, Harrix' repo was discussed before: https://github.com/lexica/lexica/issues/88#issuecomment-627005816
Hi @ildar, Thank you very much for pointing to the previous discussion. Once I finish compiling the word list manually I will release both versions under Creative Commons license. Since words of a natural language are the public domain and no restrictions may be imposed on the use of the words, the list compiled from multiple sources (books, dictionaries, one's memory etc.) constitutes the original work, and that is my contribution to the project. The dictionary that provides the definitions to the words is another matter. That may be a copyrighted work, and you'll have to contact the copyright holder to clarify the terms or to get an explicit permission if you want to integrate it into the project. While referencing the publicly available web page and displaying its contents within the app does not violate anything.
The way I see it is that there be a word list of commonly used nouns ("ё" and "е" versions), and the word list to replace the current "extended" list, which will include nationalities, terms, etc. present in the most dictionaries available on the Internet. I think that cross referencing the word list with the books like classical literature is a good thing in order to determine which words should go to what list. I can think of several nouns right on top of my head that are found in classical literature but their definitions can be found nowhere on the Internet. These will go to the extended list for sure if at all. And I can assure you that the word lists will be an original work. It just takes time.
Plz note that all the "legal" decisions are made by @pserwylo
@OpenWarp:
I did find some non-existent words in the dictionary you had compiled, as well as the words that are numerals, pronouns, and not nouns.
Could you publish the list of these words? I got the list from the "Nouns" category of Russian Wiktionary. If you found non-noun words there, Wiktionary should be updated.
I'm leaving the letter "Ё" for now on account that it is crucial to RSL (Russian as Second Language) players. And keeping the letter "Ё" did not affect the gaming experience so far.
I agree that Е and Ё are different letters. Not only for RSL, but for native speakers also. The reason (one of reasons) is letter frequency: Е is very frequent (so it costs 1 letter point) while Ё is not (3 letter points).
So as I said, I had decided to keep "Ё" for now even though most people I had asked about it strongly object.
That's right decision.
I will remove all diminutive suffixes forms in the future for sure.
I am not sure in this decision. In English dictionary, do they have both "pig" and "piglet"? "App" and "applet"?
@ildar:
RFL users don't need and don't like ё-s in Lexica-like apps.
That's not true. Many users don't like Ё, but many do. The first category of users is probably larger than the second, but you can't speak for all RFL users.
BTW, look at the Lexica's prototype:

Hi all, absolutely loving the discussion here, and apologies I've been a bit absent. More than happy to continue tweaking the Russian dictionary/dictionaries - as you've guessed I have zero knowledge of the language so will delegate all decisions on languages to the community - hence me being extra appreciative of this great discussion.
Legal thoughts
- Absolutely happy with wiktionary based creative-commons licensed dictionaries being used in Lexica.
- No need to discuss licenses for definitions - it is just a little to hard to bundle definitions into Lexica at this point, so we will not consider adding them into the app until there is a broader move to do so.
Proposed additions/changes
As discussed, I'd be happy to merge whatever dictionaries we all think are best, but right now I'm at a bit of a loss because there are a few floating around. Lets quickly summarise in the hope it will help move forward and incorporate some of these contributions.
Current dictionaries
Russian + Russian (Extended) - Originally from ASpell, then replaced with this dictionary after discussion on #88 by @ildar and @HenriDellal, culminating in PR #161 by @ildar.
Dictionary from @van-de-bugger attached in this issue Would you anticipate this is another additional dictionary? Happy to have as many as neccesary for any given language to be enjoyed by all. If so, what should it be called? e.g. "Russian (Nouns)"? If not, which would it replace, and is @ildar and others involved in creating those happy to do so?
PR from @OpenWarp (#321) This is to replace the main "Russian" dictionary (leaving "Russian (Extended)"). I'm happy to merge this as is if others such as @ildar who were involved in the original are happy to do so. If not, can we update that PR to instead be called something different? e.g. "Russian (Nouns)".
Not trying to make this a competition or anything, we all want the most fun dictionaries in Lexica - but there is now two "Russian (Noun)" dictionaries currently proposed to be added to Lexica. One from Wiktionary and one from https://github.com/Harrix/Russian-Nouns. Do we add both? or one? if one, which one?
Thanks!
On Fri, Feb 4, 2022 at 5:12 AM van-de-bugger wrote:
@OpenWarp https://github.com/OpenWarp:
I will remove all diminutive suffixes forms in the future for sure.
I am not sure in this decision. In English dictionary, do they have both "pig" and "piglet"? "App" and "applet"?
I wouldn't either. Diminutive suffixes form quite fair words that come to the mind of a player.
@ildar https://github.com/ildar:
RFL users don't need and don't like ё-s in Lexica-like apps.
That's not true. Many users don't like Ё, but many do. The first category of users is probably larger than the second, but you can't speak for all RFL users.
Note the context. I write: in Lexica-like apps. For example, in my letters and such I use ё extensively, even though I may be thought of as archaic. In contrary, "in Lexica-like apps" I'd want to use е-only avoiding thoughts of whether it's right to use е/ё here or just one of them.
Now that I am going through the word list it becomes evident that Lexica is very different from crossword puzzles. And while the dictionaries we have now are perfectly fine for crossword puzzles they are largely inadequate for Lexica. The reason for this is that in the crossword puzzle you have questions and hints, which make it easy to guess the correct form or the common name.
E.g. Second letter in Greek alphabet; the German Parliament; one of the Muses; indigenous people of (some remote area); mare at some places in Ural region; VTUZ graduate; affectionate name for ... etc.
There are no such hints in Lexica making it impossible to guess antiquated words, names, regional dialects, vernaculars, etc. And yes, it is irritating to enter all derivatives of the main word or to guess the derivative form of an uncommon word when the main word is not present on the board.
So I think that numerals, pronouns, vernaculars, vulgarisms, names of places, historical figures, names of letters in alphabets, nationalities, antiquated words, abbreviations, and derivative forms are to be removed from Lexica dictionary.
As to diminutive suffixes, I am leaving them where it is common to use these forms. And sometimes the form with diminutive suffixes can have other meanings. These words stay in the dictionary, there is no need to worry.
Compound words without hyphens is a bit tricky. After giving it some thought, I believe that only common words need to be in the dictionary while all possible permutations are too much and probably won't be possible on most boards anyway.
The most controversial part to me is the special terms. Those only known to the trade or profession. There is no chance of those to be guessed by anyone outside the trade, and there is absolutely no need for the general public to learn them. It is a bump. However, most people do belong to a certain trade, and they will be surprised if not frustrated by the absence of the terms familiar to them. Rare deceases, rear and extinct species, some very specific tools in mining industry are just a few examples.
By the end of the day, we only need one dictionary containing nouns or maybe two if we need "Е" and "Ё" versions. Harrix is a good replacement for the current Russian dictionary until the manual version is compiled. And we can safely remove the Russian (extended) dictionary. The version compiled by @van-de-bugger contains non existent words as well. Nevertheless, the effort is very much appreciated, as we need the dictionary badly.
@pserwylo:
The dictionary proposed by me is built from the Wiktionary, so "ru.wiktionary.org" (or something similar: "ru.wiktionary", "Wiktionary (Ru)", etc) would be the most straightforward name.
Just a few reasons to advocate the proposed dictionary:
-
Lexica uses Wiktionary as the source for word definitions. Using Wiktionary as the source for words just makes Lexica consistent: It is quite frustrating to see the word definition of a word rejected by Lexica.
-
The dictionary is built automatically by a simple script (~ 50 lines of code) I wrote in Raku.
-
The dictionary can be rebuilt at any time (e. g. as a step to build Lexica release) to adopt changes in Wiktionary. It requires ~ 3 min to download the dictionary from ru.wiktionary.org. (If you are interested in the script, just let me know, I'll share it.)
-
Wiktionary is a live project. The community of Wiktionary editors work on it, so new words will appear as time go by. Errors, if any found, could be easily fixed by editing Wiktionary and get into Lexica automatically.
-
It is the largest dictionary of existing Russian dictionaries in Lexica. I believe there is no reason to exclude professional terms, obsolete or profanity words, etc. If you are smart enough to find a word in the Lexica square, your knowledge of rare words and ability to locate them should be awarded, not ignored.
Regarding the other Russian dictionaries: As soon as "ru.wiktionary.org" is available in Lexica, I do not care of other dictionaries presence or absence. If someone else finds another dictionary is useful — that's ok, I don't object. Extra dictionaries do not hurt me.
Having more than one dictionary can affect newbies, though. If there is more than one dictionary for a language, every dictionary should be clearly annotated to show the difference between dictionaries. For example, "Russian (extended)" dictionary annotation is "Includes grammar cases and abbreviations". Annotation for "ru.wiktionary.org" could be "114000+ nouns from ru.wiktionary.org, nominative case only, Ё ≠ Е".
BTW, Why don't make the dictionaries optional and install them separately as plug-ins or add-ons? That would be ideal solution for everybody.
@OpenWarp:
I have asked your for the list of error in the dictionary. You did not respond but complain about non-existent words again. Could you be more specific and provide the list of errors in the dictionary you noticed? Or, at least, few examples of non-existent words? It will help to fix either Wiktionary or the build script.
@ildar:
Note the context. I write: in Lexica-like apps.
You can't speak for all users of Lexica-like apps anyway.
@van-de-bugger Sorry I missed your request. Whatever the community goes with is fine with me. I would be simply distributing the proper dictionaries among the users who specifically asked me to compile the dictionary by hand. And if you can automate the compilation script to meet the requirements I had outlined with specific examples then by all means. You need a capability to filter out garbage(а, аа, аак, аал, ба, вэ, д, да, даальдер, даб, даба, дейк, г, ге, гёрлс, ги, гиммик, гэл, гэп, собина, собинка, сим, сима, силуэтность), pronouns (он, она, тут, тот etc.), abbreviations (ВТУЗ, гидрометобсерватория, осфинконтроль, госязык, гэкачепист, гэпэушник, дезкамера, собес etc.), numeras (сто), names (аполлон, бундес*, гитлер etc.), nationalities no one ever heard about (вэйци, гаэл, гаял), compound words but not all of them (взаимоблокировка but not взаимовыручка, видеоинструментарий, гексакосиойгексеконтагексафобия, гелиооборудование, информ*, полулюбительство, самопрограммирование, смолообразование, собаковладелец, etc.), vernaculars (бу, вышак, давалка, дезуха, залипашка, залипуха, крутовать, автобусник, брючата, брюхан, брюхач, бряк, сморкач, сморкун etc.), vulgarisms (говноед), terms (вюаньятит, гидрометла, гипокаталаземия, дезоксирибонуклеопротеид, кольпоперинеолеваторопластика, симбиосома, симбиогенез) words like деинтернационализация, деиспанизация, двадцатиоднолетие, десятикилометровка. Derivatives (брючонки, брючишки, брючки, гостенёк), letter names (бета, тау, фита etc.), not real nouns (высокопреосвященнейший, вяжущее and the like). Verbs (брешешь).
There are too many examples, and shall I know how to clean it up by the script, there would be no need to cherry pick the words from different sources.
While your point that generating the list from Wiktionary has certain advantages is valid, it had been noted times and times over that the resulting word list is trash. And the main reason people are getting frustrated with Lexica. I would strongly recommend against setting it up as the default.
Dictionary as a plugin is great. The reason we do not have that is that not everyone knows to generate the letter distribution from the dictionary in order to apply it. This is crucial for the game to generate nice boards.
And the dictionary I am trying to compile is the subset of your generated word list anyway, so the definitions can be found in the Wiktionary no problem here. And the language is slow to adopt new words so it is easy to update the list manually.
Let the community decide, we do not need so many dictionaries shipped for one language, even if the language is complex. I would vote quality versus quantity.
@van-de-bugger I am sorry I can not provide a complete list of non existent words as to browse through the whole list is too much work, and it will take me months to do so. Not to mention that I would have to double check every word I suspect does not exist just to make sure. I did notice though that for most such words the Wiktionary returns and empty "usage examples" section. And it worth mentioning that Wiktionary is well maintained in therms that if there is no meaning for the word, then the word does not exist. Another approach is that you exclude Harrix word list from your own list the job of finding odd words will be easier. I never complained about your word list, this is a misunderstanding. What I was saying was that it was easier to compile the dictionary from different reliable sources like classical literature rather than cleaning up ~150,000 word list. And that is what I am busy with. And I made my point crystal clear: we have Harrix list which is perfect for crossword puzzles. Lexica is game with no hints, and as such requires the simplified dictionary following strict rules. These rules are being stretched for the commonly used forms that break them (think spelling: just one 1.5% of English words do not follow the spelling rules; however, there are 400 of them are the most commonly used words, 400 out of 2000). And that is the reason I have to compile the list by hand. Until then, Harrix is much better than any of the word lists Lexica is now coming with. It will replace the current dictionary, and the extended word list must be removed.
@OpenWarp:
That's not garbage:
- "а" is noun, the name of the first letter of Russian alphabet.
- "аа" is noun too.
- "аак" is a boat type, you can check it at ru.wiktionary.org.
- "аал" is a type of populated place.
- "ба" is a synonym for "grandma".
- "даальдер" is a name of a Netherlands coin.
- "даб" is a musical style.
- ...
Pronouns:
- "он" — It is a pronoun. But it is noun, too: this is the name of "o" letter in the Church Slavonic alphabet.
- "она" — there is no such a word in the dictionary.
- "тут" — It is a pronoun. But it is a noun, too: this is a synonym for morus plant and wood.
- "тот" — there is no such a word in the dictionary.
Abbreviations:
- "втуз" — Some dictionary spells it as "ВТУЗ" and consider it an abbreviation. But some dictionaries spell it as "втуз" and consider it a neologism which has all grammatical cases (втуза, втузу, втузе, ...) and derived adjective (втузовский).
- "гидрометобсерватория" — I do not see any problem with this word. If someone is able to find in Lexica square, it must be damn smart guy.
Names:
- "аполлон" — this is a name of the species Parnassius apollo, a batterfly, not the proper name of greek god.
- "бундес*" — they are not proper names, what the problem with them?
- "гитлер" is a slang name for bottle of 0.75 l.
Nationalities no one ever heard about (вэйци, гаэл, гаял)...
Take it easy. If someone knows these nationalities, let's simply award that guy knowledge, not ignore it.
Compound words but not all of them (взаимоблокировка but not взаимовыручка...
I see both "взаимоблокировка" and "взаимовыручка" in the dictionary, as well as all other words you mentioned. And I do not see any problems with these words.
I do not see any problems with vernaculars, vulgarisms (жопа есть а слова нет?), terms, derivatives, letter names and words like "деинтернационализация".
Not real nouns:
- "высокопреосвященнейший" — you may like or not like it, bit it is a noun: it is the name of the church title.
From all your examples I found only one bug in Wiktionary:
- "сто"
and two questionable words:
- "вяжущее"
- "брешешь"
In both cases wiktionary contains explanation why these words are considered nouns.
Also, I forget to filter out too short words (made of 1 or 2 letters, it seems 3-letter words are the shortest allowed in Lexica), but it does not hurt Lexica and very easy to fix.
Thus, I do not see "too many examples" of "non-existent words". Most of the examples are just matter of your personal taste.
Lexica trusts Wiktionary as a provider of the word definitions, but does not trust to be a provider of the dictionary? It is silly.
it was easier to compile the dictionary from different reliable sources like classical literature rather than cleaning up ~150,000 word list.
It seems you is going to reinvent the wheel and drastically underestimate the amount of work. I guess you are not aware about all the problems and issues. Just answer the question: do you know what is НКРЯ?
Wiktionary community is made up of interested people and has been working on the dictionary for many years. It is simply unwise to reject their result, which is available for free.
@OpenWarp
Since words of a natural language are the public domain and no restrictions may be imposed on the use of the words, the list compiled from multiple sources (books, dictionaries, one's memory etc.) constitutes the original work
No, a list of words is a database and any database (in at least Russia and the European Union) has database rights unrelated to copyright. This is not related to copyright, it is a separate intellectual property right. See Wikipedia here. Just posted an issue at Harrix repo here because a list of words has database rights unrelated to copyright.