lexica icon indicating copy to clipboard operation
lexica copied to clipboard

Suggestions for Italian dictionary

Open airon90 opened this issue 5 years ago • 4 comments

Hi, I am playing Lexica in Italian. Often some non-Italian letters or accented Italian letter appear, reducing the possibility to play the game, as there aren't so much Italian words containing these words. Often, only one words contains these letters. Moreover accents appear only at the end of some words.

So I suggest to remove English-only letters (j, k ,w, x, y), accented letters (à, è, ì, ò, ù) and some uncommon letters (h, q) from the possibility to appear in the table, in order to make the game better. Accented letter could be comverted in theis basic letter (a, e, i, o, u)

airon90 avatar Sep 25 '19 09:09 airon90

From a comment on Google Play (seemed relevant to this):

Very fun and interesting, but the Italian has some issues, it reports a lot of words that don't exist a d seems to skip a few existing ones Also the accents aren't used enough in Italian to make sense in the game, they just become dead cells. Great design for the multiplayer!

I've done a quick count of how many times each letter is represented in the Italian dictionary, and came up with this:

for CHAR in $(./show-chars-in-dict.sh it); do COUNT=$(grep $CHAR assets/dictionaries/dictionary.it.txt | wc -l) && echo "$CHAR: $COUNT"; done
a: 192407
b: 29064
c: 77695
d: 46509
e: 140688
é: 552
è: 22
f: 26043
g: 41157
h: 6887
i: 183650
ì: 623
j: 40
k: 297
l: 100429
m: 60781
n: 92088
o: 121930
ò: 7971
p: 45093
q: 1391
r: 132735
s: 96981
t: 120556
u: 46735
ù: 42
v: 45903
w: 76
x: 2028 (But only 437 if I exclude roman numerals using the regex `^[lxivcdm]+$`)
y: 128
z: 12217

Limiting this to just those mentioned by @airon90 above, we see the following:

j: 40
k: 297
w: 76
x: 2028 (But only 437 if I exclude roman numerals using the regex `^[lxivcdm]+$`)
y: 128

à: 3849
é: 552
è: 22
ì: 623
ò: 7971
ù: 42

Again, stressing that I am not an Italian speaker, but given that ò and à appear in so many words, perhaps we should do as recommended above and normalize them to o and a respectively.

pserwylo avatar Nov 27 '21 23:11 pserwylo

With all that said, here is a proposal with some questions. If I'm able to get some confirmation from speakers of the Italian language, then I'd be happy to action them:

  • Adjust the current dictionary, resulting in only one Italian dictionary (no need for "Italian" + "Italian (extended - including à, etc)").
  • Remove all words containing the letters j, k, w, x, y.
  • Leave the uncommon letters h and q (as with English and other languages, they will be assigned a low probability, and thus appear less frequently in boards anyway).

The only question I have is what to do about the diacritics. Some proposals (from a naive English speakers perspective - please correct any misunderstanding I may have):

  1. Normalize all diacritics (convert ò -> o, à -> a, é + è -> e, ì -> i, ù -> u) as they are legitimate letters in the Italian dictionary, and players will be able to understand that, e.g. the word caffe in Lexica is actually referring to the word caffè in the Italian language.
  2. Remove all words containing diacritics (e.g. if they are indicative of loan words that players would understand are not needed for a game such as Lexica).
  3. Remove some and normalize other diacritics. This would need input from native speakers as to: Which diacritics only ever appear in loanwords (e.g. ù is only found in 42 words in this dictionary) vs others which are used in Italian words (e.g. ò which appears in 7971 words in this dictionary).

pserwylo avatar Nov 27 '21 23:11 pserwylo

Leave the uncommon letters h and q (as with English and other languages, they will be assigned a low probability, and thus appear less frequently in boards anyway).

If so, make sure that:

  • in Italian words "h" appears between "c" or "g" and "i" or "e", creating "chi", "che", "ghi", "ghe". Some imported words used in Italian may follow other rules (e.g. "hotel")
  • "q" always follows "u" ("qu") as always appear together

Normalize all diacritics (convert ò -> o, à -> a, é + è -> e, ì -> i, ù -> u) as they are legitimate letters in the Italian dictionary, and players will be able to understand that, e.g. the word caffe in Lexica is actually referring to the word caffè in the Italian language.

+1. Words must appear with correct accent in the lower part of the screen and in the final part of the game

airon90 avatar Nov 29 '21 22:11 airon90

I don't find any help/tutorial but this information must be public.

airon90 avatar Nov 29 '21 22:11 airon90

Closing this as a recent PR #362 addresses this somewhat. When going through old issues, I was attempting to fix #337 which is technically a dupe of this, but I found that one first.

If we wish to have a no-diacritics version, we can open a new issue or track it in #337.

pserwylo avatar Oct 06 '23 11:10 pserwylo