source-han-sans icon indicating copy to clipboard operation
source-han-sans copied to clipboard

Proposal of additional characters used in Written Cantonese

Open JPRidgeway opened this issue 4 years ago • 3 comments

There are frequently 'issues' requesting the addition of some glyphs (cf. 240, 228, 231). They are frequently based on gut feelings or arbitrary selections. Here I'm hoping to be at least tiniest bit more scientific in such a discussion.

It is not a secret that the IICore2020 list of Hong Kong characters does not limit those actually in use in Written Cantonese. Luckily, now we have an up-to-date method of checking which exactly are used: Robert S. Bauer's The ABC Cantonese-English Comprehensive Dictionary, specifically aimed at describing the writing traditions of current Cantonese-speakers in Hong Kong. I have cross-checked the complete text of the dictionary against both the Super OTC (to determine glyphs completely missing) and the HK Subset OTF (to find out the cases where some other locality's glyphs silently serve as fallbacks). The comprehensive results (and a manually-concocted GlyphWiki-derived Hanazono Mincho-based font containing all of Tables 1&2, attached inside the .pdf) are in the file below. These is the squeeze of the main results:

Table 1. Glyphs from Extensions B‒F absent from Source Han Sans despite usage in Written Cantonese

I request to add the completely new glyphs for the following Unicode points, drawn in HK-style, or at least those in bold:

Point Glyph Reading
U+20C32 𠰲 oek2/4/6, oet2/4/6
U+216A6 𡚦 hai1
U+22D15 𢴕 ling1, ning1
U+259F9 𥧹 dam6, tam4/5
U+265E7 𦗧 dap1
U+272F5 𧋵 kwong4
U+27685 𧚅 long6, nong6
U+2B9F6 𫧶 keng4
U+2BA5D 𫩝 ci1
U+2BAC3 𫫃 e1, nge1
U+2BAFC 𫫼 ngap1
U+2BB3F 𫬿 ngang2, ngang4
U+2BB4A 𫭊 wang6
U+2BCAD 𫲭 keng1
U+2BD71 𫵱 cat6
U+2C710 𬜐 lem2, lim2
U+2C9A0 𬦠 naai2
U+2C9CA 𬧊 lak1
U+2C9EF 𬧯 beu6
U+2CE33 𬸳 gaan2
U+2D25D 𭉝 syu2, syu4
U+2D85A 𭉝 faak3

Table 2. Glyphs outside Unicode used in Written Cantonese that can be incorporated using IDS composition

For one or other reason, there are still 11 out-of-Unicode glyphs actively used in Cantonese. Ideally, it would be great to include those as their IDS composition, but this is not of immediate necessity.

IDS Reading
⿸疒背 bui6
⿰口揼 dap6
⿰口絞 gaau6
⿰口剋 gwak6
⿰口乜 met1
⿰口揖 jaap6, ngap1
⿱弗手 dap6
⿰口壁 bek6
⿰扌罨 ngap1
⿰口梃 ting2
⿰扌熨 tong3

Table 3. Glyphs from URO and Extension A used in Written Cantonese and having HK forms in Source Han Sans

For the following, a total inaction is required, except adding them to Subset OTF for HK region.

内厾咊哴岀捞捦査横歳爲盘蝲譲趸鬦鬬䁽

Table 4. Glyphs from URO and Extension A used in Written Cantonese not having HK forms in Source Han Sans

These require additional HK form to be drawn.

哾噖庝掕攋煈脶舗蝄㞓䭤

To conclude: the simplest and immediate course of action is adding the characters in Tables 3 and 4 to the Subset OTF for Cantonese.

Cantonese Breakdown Chart.pdf

JPRidgeway avatar Aug 28 '19 18:08 JPRidgeway

@JPRidgeway I wrote a somewhat substantial reply earlier today, which somehow never posted. I may have previewed it, but never clicked the green "Comment" button. So, I'll try again.

What I wanted to first point out is that most of the ideographs in Table 2 are either in UAX45 or IRG Working Set 2017:

IDS Reading U-Source UK-Source IRG Working Set 2017
⿸疒背 bui6 UTC-00703
⿰口揼 dap6 UTC-00416*
⿰口絞 gaau6 UTC-00563
⿰口剋 gwak6
⿰口乜 met1 UK-10503 00373
⿰口揖 jaap6, ngap1 UTC-00380
⿱弗手 dap6 UTC-00637
⿰口壁 bek6
⿰扌罨 ngap1 UTC-00653
⿰口梃 ting2
⿰扌熨 tong3 UTC-00423

* = UTC-00904 is a known duplicate of UTC-00416.

For the seven ideographs in UAX45 with U-Source source references, please provide to the UTC evidence so that we can include them in the UTC's submission for the next IRG working set. For the three that do not have a source reference, please submit them to the UTC, along with evidence and metadata, so that they can be added to UAX45 and be assigned a U-Source source reference. This is the first step toward getting them encoded. See L2/19-043 as an example of a proposal to add ideographs to UAX45.

With regard to implementing the ideographs in Table 2 via their IDSes, that will not happen. The use of IDSes and the 'ccmp' GSUB feature for biáng and friends was considered an extraordinary case that we don't plan to repeat. Also, the maintenance update that is planned for later this year will encode those Extension G glyphs from their finally-stable Plane 3 code points, and as a result, I will also nuke their IDS-based 'ccmp' substitutions from orbit.

In any case, everything in all four of your tables would be targeted for Version 3.000, which is neither planned not scheduled at this point. The Good News™ is that this gives you sufficient time to take care of 10 of the 11 ideographs in Table 2, in terms of preparing them for encoding.

kenlunde avatar Aug 28 '19 20:08 kenlunde

A lesson to myself: stop relying on a chart of IRG Working Set 2015 as my only source of not-yet-Unicode content and use current version of U-Source charts.

Five of these 11 are claimed in the aforementioned charts as “W: Not suitable for encoding as a CJK Unified Ideograph”. I wonder where was the reasoning for that and whether new evidence of Cantonese usage could be an opposing argument. (⿸疒背, say, is also a Japanese 国字, used in 世尊時本字鏡, but still marked W).

Of course, an ability to encode in Unicode and then add to the font the proper way is a much better solution than any ad hoc solutions (I believe, inputting through IDSes is not even the expected one for Cantonese users, the expected one is using some well-aligned PUA, which is definitely non-Source Han Sans). I might attempt assembling usage data for the glyphs under discussion for a new proposal.

Good luck in preparing future versions!

JPRidgeway avatar Aug 29 '19 07:08 JPRidgeway

@JPRidgeway My point about the ideographs in Table 2, particularly the seven that are in UAX45, is that you seem to be in a unique position to provide to the UTC evidence that would allow their status to be changed from W or X to N.

In terms of places to look, you need to track all extensions, with Extension G being the latest that is in the pipeline for Unicode Version 13.0. You also need to look at the characters that have been appended to the URO, and now, starting from 13.0, those appended to Extensions A and B. And yes, UAX45, which grows a bit with each version of Unicode, along with the IRG working sets, with IRG Working set 2017 being the latest.

kenlunde avatar Aug 29 '19 17:08 kenlunde