feat(funbox): added support for cyrillic and arabic charset (@m4dd0c)
Description
Added Arabic & Russian (cyrillic) Funboxes
Added two new funboxes Arabic & Russian with gibberish word generators. Also added logic to automatically force the Arabic language if the Arabic funbox is active to prevent config issues.
Changes
- Added
getArabic()andgetRussian()utility functions. - Integrated both into
funbox-functions.tsandlist.tsas Metadata. - Forced Arabic language setting in
setLanguage()when needed. - Updated
FunboxName types.
Closes #6181
Let me know If any changes are required.
Continuous integration check(s) failed. Please review the failing check's logs and make the necessary changes.
also for the names, like here "arabic" feels too generic and misses the random word vibe, unlike gibberish and others. I propose "Fawda" (فوضي, means in arabic "mess") or "Harfiyya" (حرفية, from “letter”). Both are catchy and hint at the playful chaos. It would be much better if it was as arabic word but I think this isn't an option , idk about russian but If this is agreed I think it should be changed too.
Nice work though!
Before all of that also wait for @Miodec to approve everything.
also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?
also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?
I was thinking the same
also idea , since those are nearly gibberish again , instead of making those a new funboxes shouldn't we just try to make the gibberish mode support different lanuages ? like selecting the gibberish funbox using english,to generate a latin letters , if with using arabic to generate and show arabic letters and so on !?
Great Idea actually. I'd love to work on it. Also @fehmer supporting the idea. So we are good to go I believe.
I wonder What's gonna happen to the current issue #6181 and PR #6488
Yeah, thats what i originally had in mind - not adding new funboxes but modifying existin ones based on the active language.
since those are nearly gibberish again
@byseif21 Can you specify, What exactly do you mean by this.
since those are nearly gibberish again
@byseif21 Can you specify, What exactly do you mean by this.
I meant and think they meant too is " instead of making a separated more gibberish modes again with just different language names", we could just make gibberish mode itself support different languages. also not only for this specific mode but any other mode that relies on the same approach (needing a character list), I think for now we could make a folder that has files with letters for each dialect or language (Arabic, Latin, Cyrillic, etc.) and make a script to work with Funbox modes when it's on , and when switching the language, it switches the letter lists that it generates from and generate only if needed, idk if we may need to refactor things to make this in the best way or if there's a better idea doing that. I'm just throwing ideas in my mind rn. if I touched that code I may think differently....
@Miodec @fehmer , do you have a specific approach in your mind to do that in the best possible way?
the generator for gibberish and ascii are using codepoints for the letters, maybe we can use the same approach for arabic and cyrillic letters. I don't know anything about the arabic or cyrillic letters and which are used in which language. I found codepoints for cyrillic https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode and arabic https://en.wikipedia.org/wiki/Arabic_script_in_Unicode so maybe we can use a subset of them?
Ithe codepoint range will not work or get to messy, we could create sth like
type Charset = "latin"|"arabic"|"cyrillic";
const lettersByCharset: Record<Charset,string[]> = {
latin: [...],
arabic: [...],
cyrillic: [...],
}
and add a parameter of type Charset to the gibberish and ascii generator.
On the funbox side we need a mapping between language and charset. either fixed rules or we add a charset property into the language groups. Then map the current language Config.language to a charset, than call the gibberish/ascii generator with this charset.
the generator for gibberish and ascii are using codepoints for the letters, maybe we can use the same approach for arabic and cyrillic letters. I don't know anything about the arabic or cyrillic letters and which are used in which language. I found codepoints for cyrillic https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode and arabic https://en.wikipedia.org/wiki/Arabic_script_in_Unicode so maybe we can use a subset of them?
Ithe codepoint range will not work or get to messy, we could create sth like
type Charset = "latin"|"arabic"|"cyrillic"; const lettersByCharset: Record<Charset,string[]> = { latin: [...], arabic: [...], cyrillic: [...], }and add a parameter of type
Charsetto the gibberish and ascii generator.On the funbox side we need a mapping between language and charset. either fixed rules or we add a charset property into the language groups. Then map the current language
Config.languageto a charset, than call the gibberish/ascii generator with this charset.
Okaaay that sounds great here. totally got it.
@m4dd0c would you need help on that ? If you had a trouble or anything I may try to help in the Arabic part if you want!
I'm currently doing it but in more granularly way. Basically creating list of charset used in each language e.g.,
const charsets = {
spanish: alterLatin({ lettersToAdd: ["ñ", "Ñ"] }),
german: alterLatin({ lettersToAdd: ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"] }),
...
}
This charset will be used in gibberish funbox.
Thoughts?
PS: I am creating a discussion for further discussion.
I'm currently doing it but in more granularly way. Basically creating list of charset used in each language e.g.,
const charsets = { spanish: alterLatin({ lettersToAdd: ["ñ", "Ñ"] }), german: alterLatin({ lettersToAdd: ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"] }), ... }This
charsetwill be used ingibberish funbox.Thoughts?
PS: I am creating a
discussionfor further discussion.
for consistency I think we should go first with the approach of using codepoints. If that didn't work for any reason , we can think of another approach. what do you think ?
Okay, I'll start working on it.
If we want to be lazy we could extract the letters from the language json files
I don’t think Codepoint thing gonna workout for us.
If we want to support each available language, then we must have associated alphabet/charset in each language group.
Many of the languages have their own different charset.
E.g., Italian
// It will remove and add these letters to latin.
italian: alterLatin({
lettersToAdd: ['x', 'k'],
lettersToRemove: ['q', 'z']
});
Since the unicode/codepoints are sequential, I don’t think, It’d be a good Idea to use ‘em, Keeping in mind that a language can have Bonus letters / characters (e.g., Ñ / ñ) as well.
The best approach in my opinion is to have a charset/alphabet for each language group.
@m4dd0c i agree that using code points will not be good if we want to support per-language.
Defining the charset based on the language group will also not work. For example in "tamil" there is "tamil" and "tanglish" with different charsets. Afair some languages we support have like native script and a latin version.
I have to talk with @Miodec about this, but I think the best idea would be to add a charset property of type string[] to each language file. It is loaded anyway if you switch languages and the gibberish funbox can just use this.
If someone wants to add a new language they have to provide the charset in the language file (checked by the validation) instead of needing to touch just another file somewhere in the source code.
I would write a script to extract the used characters from each language file and add the property initially. This way we don't need to investigate which characters should be used for each language.
Thanks for the work guys, we had a chat, decided on the following approach:
We will only modify the gibberish funbox.
Lets add 3 charsets: latin, arabic and cyrillic to the generate.ts file.
Add an optional property to JSON language files called charset, which can be arabic or cyryllic (latin is default so we wont bother adding it everywhere)
Then, when we generate gibberish, take the charset prop from the language file (deafult to latin if not found) and generate the strings.
Thanks for the work guys, we had a chat, decided on the following approach:
We will only modify the gibberish funbox. Lets add 3 charsets:
latin,arabicandcyrillicto thegenerate.tsfile. Add an optional property to JSON language files calledcharset, which can bearabicorcyryllic(latinis default so we wont bother adding it everywhere) Then, when we generate gibberish, take thecharsetprop from the language file (deafult tolatinif not found) and generate the strings.
Quick Question: Hindi, Sanskrit, Marathi, and Nepali uses devanagari charset, should I add that as well?
Thanks for the work guys, we had a chat, decided on the following approach: We will only modify the gibberish funbox. Lets add 3 charsets:
latin,arabicandcyrillicto thegenerate.tsfile. Add an optional property to JSON language files calledcharset, which can bearabicorcyryllic(latinis default so we wont bother adding it everywhere) Then, when we generate gibberish, take thecharsetprop from the language file (deafult tolatinif not found) and generate the strings.Quick Question: Hindi, Sanskrit, Marathi, and Nepali uses
devanagaricharset, should I add that as well?
Yes
Hi @m4dd0c Very nice work! Just a little curious - did you test if the Arabic is shown with connected or separated characters here?
Hi @m4dd0c Very nice work! Just a little curious - did you test if the Arabic is shown with connected or separated characters here?
Thank you @byseif21 I don't understand Arabic. can you check and tell me.
I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg
standard_arabic: [ // this witout the extended
{ start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
{ start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)
{ start: 1610, end: 1610} // U+064A (ي)
arabic: { // this with
start: 1569, // ء (U+0621)
end: 1610 // ي (U+064A)
},
also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for navtive user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk
I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg
standard_arabic: [ // this witout the extended { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ) { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و) { start: 1610, end: 1610} // U+064A (ي) arabic: { // this with start: 1569, // ء (U+0621) end: 1610 // ي (U+064A) },also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for native user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk
I intentionally kept the charset range minimal, since there are plenty of bloated letters that may cause issues while typing. e.g., blank characters, unsupported characters, and letters that are not being used in any language.
Furthermore, If I go with the approach you have mentioned then I wonder how someone would select between the standard and non-standard version of different languages?
What I have in mind is, We can have 2 funboxes,
- Gibberish Standard
- Gibberish Extended
Thoughts?
I think mios idea was to not overcomplicate this feature. https://github.com/monkeytypegame/monkeytype/pull/6488#issuecomment-2829756423
With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).
Can we do the same for the other languages, find a minimal, common set which should be typeable?
I think mios idea was to not overcomplicate this feature. #6488 (comment)
With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).
Can we do the same for the other languages, find a minimal, common set which should be typeable?
This PR, Currently supporting minimal, common set range only as @Miodec suggested.
I checked and they are fine connected as it should, however I noticed another thing here that the arabic ranges have the extended Arabic letters and they aren't typeable for the Arabic languages (arabic , Arabic_Egypt) as they use standard characters only, but the extended letters may be used in different languages l(e.g., Persian, Kurdish, Urdu) so I think we should separate the range to be at least like two , one as standard for the language and a general one, eg
standard_arabic: [ // this witout the extended { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ) { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و) { start: 1610, end: 1610} // U+064A (ي) arabic: { // this with start: 1569, // ء (U+0621) end: 1610 // ي (U+064A) },also I think some other languages may have that problem too, so doing the same approach ( one without the extended & one with the extended) or doing range for each language separately will be better I guess , we can just leave the rest and wait for a native user to put them correctly . or we can leave them like that generally and wait for native user for each language to notice and if he felt the need to change/fix the range, he do it himself or open an issue with it, idk
I intentionally kept the charset range minimal, since there are plenty of bloated letters that may cause issues while typing. e.g., blank characters, unsupported characters, and letters that are not being used in any language.
Furthermore, If I go with the approach you have mentioned then I wonder how someone would select between the standard and non-standard version of different languages?
What I have in mind is, We can have 2 funboxes,
- Gibberish Standard
- Gibberish Extended
Thoughts?
No one will select anything, and we don't have to make a two different modes, the languages will just use different charset names in the language files , language eg. for the arabic with standard letters (arabic , Arabic_Egypt) will put in the charset, charset: standard_arabic
and the languages that need the extended will set it as charset: arabic , just like that
I think mios idea was to not overcomplicate this feature. #6488 (comment)
With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).
Can we do the same for the other languages, find a minimal, common set which should be typeable?
Okay that sound the best solution to not overcomplacate the things more than that here , I did the search and that may be the best ranges we can go with for now , all without the extended letters, rare characters, or problematic symbols
arabic: [
{ start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ)
{ start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و)
{ start: 1610, end: 1610} // U+064A (ي)
],
latin: {
start: 97, // a (U+0061)
end: 122 // z (U+007A)
},
cyrillic: {
start: 1072, // а (U+0430)
end: 1103 // я (U+044F)
},
devanagari: [
{ start: 2309, end: 2361 }, // U+0905–U+0939 (अ to ह)
{ start: 2366, end: 2376 } // U+093E–U+0948 (vowel signs आ to ऐ)
],
geez: [
{ start: 4768, end: 4960 } // U+1200–U+135F (ሀ to ፟)
],
tamil: [
{ start: 2949, end: 3020 }, // U+0B85–U+0BBC (அ to ஔ)
{ start: 3006, end: 3028 } // U+0BBE–U+0BCC (vowel signs ா to ௌ)
],
telugu: [
{ start: 3077, end: 3148 }, // U+0C05–U+0C4C (అ to ౌ)
{ start: 3158, end: 3160 } // U+0C56–U+0C58 (additional vowels ౖ to ౘ)
],
bengali: [
{ start: 2437, end: 2489 }, // U+0985–U+09B9 (অ to হ)
{ start: 2494, end: 2508 } // U+09BE–U+09CC (vowel signs া to ৌ)
],
malayalam: [
{ start: 3333, end: 3396 }, // U+0D05–U+0D3C (അ to ഹ)
{ start: 3398, end: 3404 } // U+0D3E–U+0D44 (vowel signs ാ to ൄ)
],
kannada: [
{ start: 3205, end: 3268 }, // U+0C85–U+0CBC (ಅ to ಹ)
{ start: 3270, end: 3276 } // U+0CBE–U+0CC4 (vowel signs ಾ to ೄ)
],
burmese: [
{ start: 4096, end: 4138 } // U+1000–U+102A (က to ဪ)
],
tibetan: [
{ start: 3904, end: 3911 } // U+0F40–U+0F47 (ཀ to ཧ)
],
sinhala: [
{ start: 3461, end: 3516 }, // U+0D85–U+0DBC (අ to හ)
{ start: 3535, end: 3551 } // U+0DCF–U+0DDF (vowel signs to ෟ)
],
hebrew: {
start: 1488, // א (U+05D0)
end: 1514 // ת (U+05EA)
},
thai: [
{ start: 3585, end: 3631 } // U+0E01–U+0E2F (ก to ๏)
],
greek: {
start: 945, // α (U+03B1)
end: 969 // ω (U+03C9)
},
han: [
{ start: 19968, end: 27903 } // U+4E00–U+6CAF (common CJK ideographs)
],
hangul: {
start: 44032, // 가 (U+AC00)
end: 55203 // 힣 (U+D7A3)
},
khmer: [
{ start: 6016, end: 6067 } // U+1780–U+17B3 (ក to ឳ)
],
ol_chiki: [
{ start: 7248, end: 7293 } // U+1C5A–U+1C7D (ᱚ to ᱽ)
],
hiragana: {
start: 12353, // あ (U+3041)
end: 12438 // ん (U+3096)
},
katakana: {
start: 12449, // ア (U+30A1)
end: 12538 // ン (U+30FA)
}
};
I think mios idea was to not overcomplicate this feature. #6488 (comment)
With latin we use the most basic alphabet a-z. For like italian we would have unused letters (like j and k) and are missing letters like (è).
Can we do the same for the other languages, find a minimal, common set which should be typeable?
Okay that sound the best solution to not overcomplacate the things more than that here , I did the search and that may be the best ranges we can go with for now , all without the extended letters, rare characters, or problematic symbols
arabic: [ { start: 1569, end: 1594 }, // U+0621–U+063A (ء to غ) { start: 1601, end: 1608 }, // U+0641–U+0648 (ف to و) { start: 1610, end: 1610} // U+064A (ي) ], ....
Well, I like the idea. I'll make proposed changes. But not sure what @Miodec 's take on this.
I will update the getGibberish logic to generate combined letters (supporting matras) for certain scripts e.g., devanagari.
Continuous integration check(s) failed. Please review the failing check's logs and make the necessary changes.