Scribe-Data
Scribe-Data copied to clipboard
Add Japanese keyboard
Terms
- [X] I have searched open new keyboard issues
- [X] I agree to follow Scribe-Data's Code of Conduct
- [X] This issue is about one language, and I've changed the title to reflect this
Language support
I do not have a lot of knowledge about the Japanese language, but I thought it would be good to implement.
Contribution
Might need help by someone who knows Japanese.
Do you want to start off making versions of the nouns and verbs SPARQL queries, @henrikth93? We might need to check the statements on the Wikidata items for Japanese nouns and verbs, but I'd be happy to help with this!
Linked to this issue is #105 and #106. For this issue we'll be working on the formatting process.
Sharing some pointers..
We'll have to think through how to store different written forms together for the same word. For example, using our classic book example. The following are both ways to write the same word:
- 本
- This is the kanji version, which is logographic
- This character represents 'book' (worth noting though that some words can be composed of more than one kanji to represent it)
- ほん
- This is the hiragana version, which is phonetic
- These are two characters, ほ (ho) and ん (n), which make up ほん (hon)
Apart from the two scripts above, the third main one is katakana, which is also phonetic. Katakana is primarily for distinct cases/meanings, e.g. writing foreign words that have been incorporated into Japanese. Some words though can have variants in all three scripts - with the katakana version having a more specific meaning than the hiragana version. Worth noting as well though that katakana can also be used at times to what would be akin to bold and italic in English.
.. can also be used at times to what would be akin to bold and italic in English.
While this is true, we most likely do not have to store this, but just something to be aware of.
So we should plan on basically having ja and ja-hira versions of all of the queries? Each Japanese lexeme has versions of each of these, and then we'd have different interfaces for each?
So we should plan on basically having
jaandja-hiraversions of all of the queries?
Hmm.. I just checked, and perhaps not quite, I think.
Some words do not have a kanji form, so I wouldn't expect them to have both ja and ja-hira.
The verb いる (iru) for instance, which very roughly translates to 'to be' or 'to exist', only has a hiragana form - made of the two characters い (i) and る (ru).
However - the lexeme actually marks いる with ja and not ja-hira as might be expected. My guess would be then that ja is marking what would be considered the "full" or the "proper" written form:
- For 'to be/to exist', it is simply
いる, since it has no kanji or katakana form - For 'book', it is
本- It is worth noting that a version with kanji, if a word has one, is often the "full" form (not sure what to call it :laughing:)
- For the verb 'to eat', it is
食べる(taberu), which actually is a combination of kanji AND hiragana.食is a kanji associated with eating and food; here it takes on the pronunciation (ta).べandるare hiragana, which respectively are for the sounds (be) and (ru)- Crucially, notice that there is also a
ja-hirafor 'to eat', which is the version written fully in hiragana,たべる, which isた(ta) and the sameべ(be) andる(ru) used in食べる - It is worth noting though that simply because
食in the verb 'to eat' has the sound (ta), it does not mean that it always has that sound. In the word定食(teishoku) for instance, which is a style of restaurant menu item,食does not have the sound of (ta) but (shoku) instead
- Crucially, notice that there is also a
- For 'person', it is the kanji
人(hito), which actually has three forms with:- the
ja-hiraformひと, which isひ(hi) andと(to) - the
ja-kanaformヒト, which isヒ(hi) andト(to)
- the
- For 'America', it is the katakana
アメリカ(amerika), withア(a)メ(me)リ(ri)カ(ka)- Interestingly, it also has a
ja-x-Q754018form, which if I were to guess, is likely the spelling using kanji that puts together characters that may have the syllables/sounds to also spell it out the same phonetically. So in亜米利加, the characters also sound out (amerika). This is more for proper nouns/names. The kanji that are used don't necessarily need to have a symbolic, associated meaning like in the other examples above. However, using kanji that both may have the correct sounds AND a symbolic meaning is often a poetic/creative deliberate decision. This is often done when naming children. Surnames also get this, for instance, mine is spelled with吉田which has the sounds吉(yoshi)田(da), but also has the meaning吉(lucky)田(ricefield) - perhaps alluding to some ancestors being farmers :shrug:
- Interestingly, it also has a
In conclusion, I believe a lexeme should always have a ja form, but it may or may not also have ja-hira, ja-kana, and/or ja-x-Q754018 forms. Crucially, ja can be in any script, whatever the "proper" form is for the word. ja-x-Q754018 may show up (for words like names of places), but I would advocate for ignoring them actually
Thanks for the full explanation, @wkyoshida! Just checking as there are a lot of situations above and I'm trying a last ditch effort for a simple-ish system: would we be able to query such that for the ja words we just get them based on their language identifier, and for ja-hira we take it if it's there, or if not get the ja?
I'm thinking what likely makes sense is:
ja: Always grab it, regardless of which script it is using. It is the "full"/"proper" form.ja-x-Q754018: If this shows up, we can ignore it.ja-hira: If this shows up, still always grab it in addition to theja. This will be needed to associate which pronunciation that the kanji in thejaform are taking on.ja-kana: If this shows up, still always grab it in addition to thejaandja-hira. If it is present, it is likely indicative of a more specific meaning. For our 'person' example人, theja-kanaform is actually more understood to mean 'human' as in the species, i.e. Homo sapiens (you'll see this listed in Wikidata under senses). It's really almost a different word at that point.- For
ja-kanathough, we may not need to store the character string necessarily. There is pretty much a direct conversion hiragana-katakana, so simply using a boolean perhaps could suffice to understand that the katakana version has a particular meaning (beyond simply meaning, for instance, that it is bold or italics)
- For
Hey @henrikth93 👋 I'm going to close this as we're going to be generating the queries for Japanese soon with the rest of the data, and when that's done we can work on adding in Japanese keyboards in the end applications. We can focus on data issues within Scribe-Data and keyboard issue in Scribe-iOS and Scribe-Android 😊
Thank you! :)