monkeytype icon indicating copy to clipboard operation
monkeytype copied to clipboard

feat(funbox): added support for cyrillic and arabic charset (@m4dd0c)

Open m4dd0c opened this issue 8 months ago • 29 comments

Description

Added Arabic & Russian (cyrillic) Funboxes

Added two new funboxes Arabic & Russian with gibberish word generators. Also added logic to automatically force the Arabic language if the Arabic funbox is active to prevent config issues.

Changes
  • Added getArabic() and getRussian() utility functions.
  • Integrated both into funbox-functions.ts and list.ts as Metadata.
  • Forced Arabic language setting in setLanguage() when needed.
  • Updated FunboxName types.

Closes #6181

Let me know If any changes are required.

m4dd0c avatar Apr 23 '25 19:04 m4dd0c

Continuous integration check(s) failed. Please review the failing check's logs and make the necessary changes.

github-actions[bot] avatar Apr 23 '25 19:04 github-actions[bot]

also for the names, like here "arabic" feels too generic and misses the random word vibe, unlike gibberish and others. I propose "Fawda" (فوضي, means in arabic "mess") or "Harfiyya" (حرفية, from “letter”). Both are catchy and hint at the playful chaos. It would be much better if it was as arabic word but I think this isn't an option , idk about russian but If this is agreed I think it should be changed too. Nice work though!

Before all of that also wait for @Miodec to approve everything.

byseif21 avatar Apr 24 '25 00:04 byseif21

also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?

byseif21 avatar Apr 24 '25 01:04 byseif21

also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?

I was thinking the same

fehmer avatar Apr 24 '25 07:04 fehmer

also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?

Great Idea actually. I'd love to work on it. Also @fehmer supporting the idea. So we are good to go I believe.

I wonder What's gonna happen to the current issue #6181 and PR #6488

m4dd0c avatar Apr 24 '25 07:04 m4dd0c

Yeah, thats what i originally had in mind - not adding new funboxes but modifying existin ones based on the active language.

Miodec avatar Apr 24 '25 11:04 Miodec

since those are nearly gibberish again

@byseif21 Can you specify, What exactly do you mean by this.

m4dd0c avatar Apr 24 '25 17:04 m4dd0c

since those are nearly gibberish again

@byseif21 Can you specify, What exactly do you mean by this.

I meant and think they meant too is " instead of making a separated more gibberish modes again with just different language names", we could just make gibberish mode itself support different languages. also not only for this specific mode but any other mode that relies on the same approach (needing a character list), I think for now we could make a folder that has files with letters for each dialect or language (Arabic, Latin, Cyrillic, etc.) and make a script to work with Funbox modes when it's on , and when switching the language, it switches the letter lists that it generates from and generate only if needed, idk if we may need to refactor things to make this in the best way or if there's a better idea doing that. I'm just throwing ideas in my mind rn. if I touched that code I may think differently....

@Miodec @fehmer , do you have a specific approach in your mind to do that in the best possible way?

byseif21 avatar Apr 24 '25 18:04 byseif21

the generator for gibberish and ascii are using codepoints for the letters, maybe we can use the same approach for arabic and cyrillic letters. I don't know anything about the arabic or cyrillic letters and which are used in which language. I found codepoints for cyrillic https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode and arabic https://en.wikipedia.org/wiki/Arabic_script_in_Unicode so maybe we can use a subset of them?

Ithe codepoint range will not work or get to messy, we could create sth like

type Charset = "latin"|"arabic"|"cyrillic";

const lettersByCharset: Record<Charset,string[]> = {
	latin: [...],
	arabic: [...],
	cyrillic: [...],
}

and add a parameter of type Charset to the gibberish and ascii generator.

On the funbox side we need a mapping between language and charset. either fixed rules or we add a charset property into the language groups. Then map the current language Config.language to a charset, than call the gibberish/ascii generator with this charset.

fehmer avatar Apr 24 '25 19:04 fehmer

the generator for gibberish and ascii are using codepoints for the letters, maybe we can use the same approach for arabic and cyrillic letters. I don't know anything about the arabic or cyrillic letters and which are used in which language. I found codepoints for cyrillic https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode and arabic https://en.wikipedia.org/wiki/Arabic_script_in_Unicode so maybe we can use a subset of them?

Ithe codepoint range will not work or get to messy, we could create sth like

type Charset = "latin"|"arabic"|"cyrillic";

const lettersByCharset: Record<Charset,string[]> = {
	latin: [...],
	arabic: [...],
	cyrillic: [...],
}

and add a parameter of type Charset to the gibberish and ascii generator.

On the funbox side we need a mapping between language and charset. either fixed rules or we add a charset property into the language groups. Then map the current language Config.language to a charset, than call the gibberish/ascii generator with this charset.

Okaaay that sounds great here. totally got it.

@m4dd0c would you need help on that ? If you had a trouble or anything I may try to help in the Arabic part if you want!

byseif21 avatar Apr 24 '25 19:04 byseif21

I'm currently doing it but in more granularly way. Basically creating list of charset used in each language e.g.,

const charsets = {
  spanish: alterLatin({ lettersToAdd: ["ñ", "Ñ"] }),
  german: alterLatin({ lettersToAdd: ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"] }),
  ...
}

This charset will be used in gibberish funbox.

Thoughts?

PS: I am creating a discussion for further discussion.

m4dd0c avatar Apr 24 '25 20:04 m4dd0c

I'm currently doing it but in more granularly way. Basically creating list of charset used in each language e.g.,

const charsets = {
  spanish: alterLatin({ lettersToAdd: ["ñ", "Ñ"] }),
  german: alterLatin({ lettersToAdd: ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"] }),
  ...
}

This charset will be used in gibberish funbox.

Thoughts?

PS: I am creating a discussion for further discussion.

for consistency I think we should go first with the approach of using codepoints. If that didn't work for any reason , we can think of another approach. what do you think ?

byseif21 avatar Apr 24 '25 20:04 byseif21

Okay, I'll start working on it.

m4dd0c avatar Apr 24 '25 21:04 m4dd0c

If we want to be lazy we could extract the letters from the language json files

fehmer avatar Apr 24 '25 21:04 fehmer

I don’t think Codepoint thing gonna workout for us.

If we want to support each available language, then we must have associated alphabet/charset in each language group.

Many of the languages have their own different charset.

E.g., Italian

// It will remove and add these letters to latin.
italian: alterLatin({
	lettersToAdd: ['x', 'k'],
	lettersToRemove: ['q', 'z']
});

Since the unicode/codepoints are sequential, I don’t think, It’d be a good Idea to use ‘em, Keeping in mind that a language can have Bonus letters / characters (e.g., Ñ / ñ) as well.

The best approach in my opinion is to have a charset/alphabet for each language group.

m4dd0c avatar Apr 25 '25 04:04 m4dd0c

@m4dd0c i agree that using code points will not be good if we want to support per-language.

Defining the charset based on the language group will also not work. For example in "tamil" there is "tamil" and "tanglish" with different charsets. Afair some languages we support have like native script and a latin version.

I have to talk with @Miodec about this, but I think the best idea would be to add a charset property of type string[] to each language file. It is loaded anyway if you switch languages and the gibberish funbox can just use this.

If someone wants to add a new language they have to provide the charset in the language file (checked by the validation) instead of needing to touch just another file somewhere in the source code.

I would write a script to extract the used characters from each language file and add the property initially. This way we don't need to investigate which characters should be used for each language.

fehmer avatar Apr 25 '25 05:04 fehmer

Thanks for the work guys, we had a chat, decided on the following approach:

We will only modify the gibberish funbox. Lets add 3 charsets: latin, arabic and cyrillic to the generate.ts file. Add an optional property to JSON language files called charset, which can be arabic or cyryllic (latin is default so we wont bother adding it everywhere) Then, when we generate gibberish, take the charset prop from the language file (deafult to latin if not found) and generate the strings.

Miodec avatar Apr 25 '25 08:04 Miodec

Thanks for the work guys, we had a chat, decided on the following approach:

We will only modify the gibberish funbox. Lets add 3 charsets: latin, arabic and cyrillic to the generate.ts file. Add an optional property to JSON language files called charset, which can be arabic or cyryllic (latin is default so we wont bother adding it everywhere) Then, when we generate gibberish, take the charset prop from the language file (deafult to latin if not found) and generate the strings.

Quick Question: Hindi, Sanskrit, Marathi, and Nepali uses devanagari charset, should I add that as well?

m4dd0c avatar Apr 25 '25 17:04 m4dd0c

Thanks for the work guys, we had a chat, decided on the following approach: We will only modify the gibberish funbox. Lets add 3 charsets: latin, arabic and cyrillic to the generate.ts file. Add an optional property to JSON language files called charset, which can be arabic or cyryllic (latin is default so we wont bother adding it everywhere) Then, when we generate gibberish, take the charset prop from the language file (deafult to latin if not found) and generate the strings.

Quick Question: Hindi, Sanskrit, Marathi, and Nepali uses devanagari charset, should I add that as well?

Yes

Miodec avatar Apr 25 '25 18:04 Miodec

Hi @m4dd0c Very nice work! Just a little curious - did you test if the Arabic is shown with connected or separated characters here?

byseif21 avatar Apr 27 '25 13:04 byseif21

Hi @m4dd0c Very nice work! Just a little curious - did you test if the Arabic is shown with connected or separated characters here?

Thank you @byseif21 I don't understand Arabic. can you check and tell me.

m4dd0c avatar Apr 27 '25 16:04 m4dd0c

I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg

    standard_arabic: [                                        // this witout the extended
      { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
      { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)  
      { start: 1610,  end: 1610}  // U+064A (ي)
     arabic: {                                                // this with
        start: 1569, // ء (U+0621)
        end: 1610   // ي (U+064A)
      },

also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for navtive user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk

byseif21 avatar Apr 27 '25 20:04 byseif21

I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg

    standard_arabic: [                                        // this witout the extended
      { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
      { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)  
      { start: 1610,  end: 1610}  // U+064A (ي)
     arabic: {                                                // this with
        start: 1569, // ء (U+0621)
        end: 1610   // ي (U+064A)
      },

also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for native user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk

I intentionally kept the charset range minimal, since there are plenty of bloated letters that may cause issues while typing. e.g., blank characters, unsupported characters, and letters that are not being used in any language.

Furthermore, If I go with the approach you have mentioned then I wonder how someone would select between the standard and non-standard version of different languages?

What I have in mind is, We can have 2 funboxes,

  1. Gibberish Standard
  2. Gibberish Extended

Thoughts?

m4dd0c avatar Apr 28 '25 04:04 m4dd0c

I think mios idea was to not overcomplicate this feature. https://github.com/monkeytypegame/monkeytype/pull/6488#issuecomment-2829756423

With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).

Can we do the same for the other languages, find a minimal, common set which should be typeable?

fehmer avatar Apr 28 '25 04:04 fehmer

I think mios idea was to not overcomplicate this feature. #6488 (comment)

With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).

Can we do the same for the other languages, find a minimal, common set which should be typeable?

This PR, Currently supporting minimal, common set range only as @Miodec suggested.

m4dd0c avatar Apr 28 '25 06:04 m4dd0c

I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg

    standard_arabic: [                                        // this witout the extended
      { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
      { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)  
      { start: 1610,  end: 1610}  // U+064A (ي)
     arabic: {                                                // this with
        start: 1569, // ء (U+0621)
        end: 1610   // ي (U+064A)
      },

also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for native user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk

I intentionally kept the charset range minimal, since there are plenty of bloated letters that may cause issues while typing. e.g., blank characters, unsupported characters, and letters that are not being used in any language.

Furthermore, If I go with the approach you have mentioned then I wonder how someone would select between the standard and non-standard version of different languages?

What I have in mind is, We can have 2 funboxes,

  1. Gibberish Standard
  2. Gibberish Extended

Thoughts?

No one will select anything, and we don't have to make a two different modes, the languages will just use different charset names in the language files , language eg. for the arabic with standard letters (arabic , Arabic_Egypt) will put in the charset, charset: standard_arabic and the languages that need the extended will set it as charset: arabic , just like that

byseif21 avatar Apr 28 '25 08:04 byseif21

I think mios idea was to not overcomplicate this feature. #6488 (comment)

With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).

Can we do the same for the other languages, find a minimal, common set which should be typeable?

Okay that sound the best solution to not overcomplacate the things more than that here , I did the search and that may be the best ranges we can go with for now , all without the extended letters, rare characters, or problematic symbols

arabic: [
      { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
      { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)
      { start: 1610,  end: 1610}  // U+064A (ي)
    ],
    latin: {
      start: 97,  // a (U+0061)
      end: 122    // z (U+007A)
    },
    cyrillic: {
      start: 1072, // а (U+0430)
      end: 1103    // я (U+044F)
    },
    devanagari: [
      { start: 2309, end: 2361 }, // U+0905–U+0939 (अ to ह)
      { start: 2366, end: 2376 }  // U+093E–U+0948 (vowel signs आ to ऐ)
    ],
    geez: [
      { start: 4768, end: 4960 } // U+1200–U+135F (ሀ to ፟)
    ],
    tamil: [
      { start: 2949, end: 3020 }, // U+0B85–U+0BBC (அ to ஔ)
      { start: 3006, end: 3028 }  // U+0BBE–U+0BCC (vowel signs ா to ௌ)
    ],
    telugu: [
      { start: 3077, end: 3148 }, // U+0C05–U+0C4C (అ to ౌ)
      { start: 3158, end: 3160 }  // U+0C56–U+0C58 (additional vowels ౖ to ౘ)
    ],
    bengali: [
      { start: 2437, end: 2489 }, // U+0985–U+09B9 (অ to হ)
      { start: 2494, end: 2508 }  // U+09BE–U+09CC (vowel signs া to ৌ)
    ],
    malayalam: [
      { start: 3333, end: 3396 }, // U+0D05–U+0D3C (അ to ഹ)
      { start: 3398, end: 3404 }  // U+0D3E–U+0D44 (vowel signs ാ to ൄ)
    ],
    kannada: [
      { start: 3205, end: 3268 }, // U+0C85–U+0CBC (ಅ to ಹ)
      { start: 3270, end: 3276 }  // U+0CBE–U+0CC4 (vowel signs ಾ to ೄ)
    ],
    burmese: [
      { start: 4096, end: 4138 } // U+1000–U+102A (က to ဪ)
    ],
    tibetan: [
      { start: 3904, end: 3911 } // U+0F40–U+0F47 (ཀ to ཧ)
    ],
    sinhala: [
      { start: 3461, end: 3516 }, // U+0D85–U+0DBC (අ to හ)
      { start: 3535, end: 3551 }  // U+0DCF–U+0DDF (vowel signs ඾ to ෟ)
    ],
    hebrew: {
      start: 1488, // א (U+05D0)
      end: 1514    // ת (U+05EA)
    },
    thai: [
      { start: 3585, end: 3631 } // U+0E01–U+0E2F (ก to ๏)
    ],
    greek: {
      start: 945,  // α (U+03B1)
      end: 969     // ω (U+03C9)
    },
    han: [
      { start: 19968, end: 27903 } // U+4E00–U+6CAF (common CJK ideographs)
    ],
    hangul: {
      start: 44032, // 가 (U+AC00)
      end: 55203    // 힣 (U+D7A3)
    },
    khmer: [
      { start: 6016, end: 6067 } // U+1780–U+17B3 (ក to ឳ)
    ],
    ol_chiki: [
      { start: 7248, end: 7293 } // U+1C5A–U+1C7D (ᱚ to ᱽ)
    ],
    hiragana: {
      start: 12353, // あ (U+3041)
      end: 12438    // ん (U+3096)
    },
    katakana: {
      start: 12449, // ア (U+30A1)
      end: 12538    // ン (U+30FA)
    }
  };
 

byseif21 avatar Apr 28 '25 09:04 byseif21

I think mios idea was to not overcomplicate this feature. #6488 (comment)

With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).

Can we do the same for the other languages, find a minimal, common set which should be typeable?

Okay that sound the best solution to not overcomplacate the things more than that here , I did the search and that may be the best ranges we can go with for now , all without the extended letters, rare characters, or problematic symbols

arabic: [
      { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
      { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)
      { start: 1610,  end: 1610}  // U+064A (ي)
    ],
....
 

Well, I like the idea. I'll make proposed changes. But not sure what @Miodec 's take on this.

I will update the getGibberish logic to generate combined letters (supporting matras) for certain scripts e.g., devanagari.

m4dd0c avatar Apr 28 '25 12:04 m4dd0c

Continuous integration check(s) failed. Please review the failing check's logs and make the necessary changes.

github-actions[bot] avatar May 08 '25 18:05 github-actions[bot]