wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

[en] Canonical forms for japanese words.

Open daxida opened this issue 2 months ago • 6 comments

2025-11-14T09:00:47.786430Z  WARN kty: Canonical form: '楽しい -i' != word: '楽しい' @ https://en.wiktionary.org/wiki/楽しい#Japanese
2025-11-14T09:00:47.787128Z  WARN kty: Canonical form: '好き -na' != word: '好き' @ https://en.wiktionary.org/wiki/好き#Japanese
2025-11-14T09:00:47.788261Z  WARN kty: Canonical form: '走る intransitive godan' != word: '走る' @ https://en.wiktionary.org/wiki/走る#Japanese
2025-11-14T09:00:47.788436Z  WARN kty: Canonical form: '五色 ^' != word: '五色' @ https://en.wiktionary.org/wiki/五色#Japanese

I know that in general the canonical form can be different from word, but here it seems like there is a pollution from grammatical inflection endings.

Also what would be the correct approach to get the reading of a word? I tried sounds::other, but sometimes the template is not parsed (cf this), then I tried replacing the ruby in the canonical form, but stepped into the above issue.

daxida avatar Nov 14 '25 09:11 daxida

Also what would be the correct approach to get the reading of a word? I tried sounds::other, but sometimes the template is not parsed (cf this), then I tried replacing the ruby in the canonical form, but stepped into the above issue.

There don't seem to be any other readings for that entry. Could you unpack this a little? "お腹が空いた" is always "onaka ga suita" afaict, it's not a kanji entry where different readings would make sense.

kristian-clausal avatar Nov 14 '25 09:11 kristian-clausal

All the above words have their reading in sounds::other. You can see it in the kaikki links:

2025-11-14T09:49:50.953155Z  WARN kty: Canonical form: '楽しい -i' != word: '楽しい'
https://en.wiktionary.org/wiki/楽しい#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/楽/楽し/楽しい.html


2025-11-14T09:49:50.953861Z  WARN kty: Canonical form: '好き -na' != word: '好き'
https://en.wiktionary.org/wiki/好き#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/好/好き/好き.html


2025-11-14T09:49:50.954064Z  WARN kty: Equal for word: '狸'
https://en.wiktionary.org/wiki/狸#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/狸/狸/狸.html


2025-11-14T09:49:50.955014Z  WARN kty: Canonical form: '走る intransitive godan' != word: '走る'
https://en.wiktionary.org/wiki/走る#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/走/走る/走る.html


2025-11-14T09:49:50.955192Z  WARN kty: Canonical form: '五色 ^' != word: '五色'
https://en.wiktionary.org/wiki/五色#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/五/五色/五色.html


2025-11-14T09:49:50.955263Z  WARN kty: Equal for word: 'お腹が空いた'
https://en.wiktionary.org/wiki/お腹が空いた#Japanese
https://kaikki.org/dictionary/All%20languages%20combined/meaning/お/お腹/お腹が空いた.html

I can only assume that they come from the pronunciation section (parsing of {{ja-pron}}), but for some reason it is not there for the last one (which, by the way, happens to have a proper canonical word) -- yet the template can be found in wiktionary, in the pronunciation section.

Sorry about the WARNS, they are there because of reasons :D

daxida avatar Nov 14 '25 09:11 daxida

Ah, I see what you mean, I was refering to the japanese reading (that's why I mentioned replacing the ruby).

For お腹が空いた it should be おなかがすいた (or with some spaces in between, that is fine too).

daxida avatar Nov 14 '25 09:11 daxida

That's the pronunciation, not the "reading", which is a lexical thing that tells the many ways to interpret (not necessarily pronounce) kanji into words. In tanoshii, the "other" comes from the line that shows the pitch-accent contour, the first thing in pronunciations, it's just that the pitch-accent marker above has been stripped away, and the following text in brackets with the romanization showing the pitch accent is not preserved...

No wonder the pitch accent marks on top in

Image

is fucking span style borders.

kristian-clausal avatar Nov 14 '25 09:11 kristian-clausal

Looking at https://en.wiktionary.org/wiki/%E7%8B%B8#Japanese the Kanji section has Readings, and is the only appropriate place for that data, but it doesn't have that data. I'm not sure where to put that, and it might just be that we need to do it with a new list of Linkage items, like related, synonyms or derived, possibly called kanji-readings.

EDIT: character-readings rather.

kristian-clausal avatar Nov 14 '25 10:11 kristian-clausal

Fixed the suffix issue, I'll have to do some actual office work starting next week so the rest is on the back burner.

kristian-clausal avatar Nov 14 '25 11:11 kristian-clausal