rime-cangjie icon indicating copy to clipboard operation
rime-cangjie copied to clipboard

Add mappings for Ext. E/F/G

Open JLHwung opened this issue 3 years ago • 2 comments

Fixes #5

This PR adds mappings for the characters in the follwing blocks (as of Unicode 13)

  [0x2b820, 0x2cea1], // CJK Ideographs Extension E
  [0x2ceb0, 0x2ebe0], // CJK Ideographs Extension F
  [0x30000, 0x3134a] // CJK Ideographs Extension G

The mappings are copied from https://github.com/Jackchows/Cangjie5 with additional fixes in https://github.com/Jackchows/Cangjie5/pull/209. Among the mappings, some mappings are deliberately discarded because it does not fit within current scope, specfically:

  • mapping starting with z for CJK Compatibility Ideographs and CJK Compatibility Ideographs Supplement
  • mapping starting with x (we don't have x mapping for CJK Ext. B)

The first commit fixes ordering issues in current mappings. It is an editorial fix and does not have observable behaviour changes. The second commit and the third commit added new mappings ordered by cangjie code. The new mappings are appended to current mappings so the character frequency order is not affected.

When authoring this PR, I came up with two scripts, feel free to re-use it as Ext. H will be hopefully targeted to 2022. (link of scripts)

  • merge-cangjie5-txt.mjs merges the mappings defined in Cangjie5.txt of https://github.com/Jackchows/Cangjie5.
  • check-coverage.mjs checks whether we have covered all the CJK Unified Ideographs, it can be easily extended to other Unicode blocks. Thanks to this script I have open #11.

Current known issues:

use of rotational operator z in specific characters:

𮗙	buhuz
𰒥	izi
𫸪	nnz
𰨇	ozmmf
𰲞	yniz
𬢆	yzbuu

The author of https://github.com/Jackchows/Cangjie5 deliberately used Z (defined in Cangjie 6 as a rotation operator, see Section 14 for the rationale) to encode these 6 characters. However this is not consistent to what we already have for such characters in Ext. B

𠄏	ilv
𠄔	ilvv
𣀨	iiye

We have three solutions on addressing inconsistency here:

  1. Reach consensus on using z for specific new characters and add new mapping
𠄏	nnz
𠄔	ninz
𣀨	izye

The old mappings for 𠄏𠄔𣀨 will be preserved as compatibility mapping. The new mappings for 𮗙𰒥𫸪𰨇𰲞𬢆 is regarded. Both @LEOYoon-Tsaw and me are ok with using z for 𮗙𰒥𫸪𰨇𰲞𬢆. But I am open to different opinion from community.

  1. Stay with Cangjie5 code schemes and come up with our own mapping for 𮗙𰒥𫸪𰨇𰲞𬢆. I can revise this PR on the new mappings

  2. remove 𮗙𰒥𫸪𰨇𰲞𬢆 from mappings and postpone until we have consensus on how to encode 𮗙𰒥𫸪𰨇𰲞𬢆.

My preference on these 3 solutions is 1 > 2 > 3.

JLHwung avatar Mar 19 '21 16:03 JLHwung

https://github.com/rime/rime-cangjie/blob/8dfad9e537f18821b71ba28773315d9c670ae245/cangjie5.dict.yaml?raw=1

line 16
# 包含結構的單字,被包含部分的編碼位於'符號之後,可據此取得尾碼。

Is this for the dict feature?
How many chars introduced in this PR are actually involved in any dict that we provide?
~~it'd better be zero~~ ;P

Un1Gfn avatar Dec 24 '21 07:12 Un1Gfn

How many chars introduced in this PR are actually involved in any dict that we provide?

I am not familiar with the dict feature. Can you point me to some references?

Disclaimer: I use the mapping a lot and mostly query only Ext. A - G characters. I would say the mapping is quite good in general. Supporting new Ext blocks is hard and I think it is fine to just merge the PR and move forward. We can always iterate when we found errors or if we can do more for the dict feature.

JLHwung avatar Dec 24 '21 15:12 JLHwung