encoding
encoding copied to clipboard
If gb18030 is revised, consider aligning the Encoding Standard
Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except The 14 characters that still mapped into Unicode PUA that at 2005, But nowadays, all the 14 characters have correlated mapping into Unicode, So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.
The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and the corresponding Unicode non-PUA character
Han Character GBK Unicode PUA Unicode non-PUA
FE50 E815 2E81
FE51 E816 20087
FE52 E817 20089
FE53 E818 200CC
FE54 E819 2E84
FE55 E81A 3473
FE56 E81B 3447
FE57 E81C 2E88
FE58 E81D 2E8B
FE59 E81E 9FB4
FE5A E81F 359E
FE5B E820 361A
FE5C E821 360E
FE5D E822 2E8C
FE5E E823 2E97
FE5F E824 396E
FE60 E825 3918
FE61 E826 9FB5
FE62 E827 39CF
FE63 E828 39DF
FE64 E829 3A73
FE65 E82A 39D0
FE66 E82B 9FB6
FE67 E82C 9FB7
FE68 E82D 3B4E
FE69 E82E 3C6E
FE6A E82F 3CE0
FE6B E830 2EA7
FE6C E831 215D7
FE6D E832 9FB8
FE6E E833 2EAA
FE6F E834 4056
FE70 E835 415F
FE71 E836 2EAE
FE72 E837 4337
FE73 E838 2EB3
FE74 E839 2EB6
FE75 E83A 2EB7
FE76 E83B 2298F
FE77 E83C 43B1
FE78 E83D 43AC
FE79 E83E 2EBB
FE7A E83F 43DD
FE7B E840 44D6
FE7C E841 4661
FE7D E842 464C
FE7E E843 9FB9
FE80 E844 4723
FE81 E845 4729
FE82 E846 477C
FE83 E847 478D
FE84 E848 2ECA
FE85 E849 4947
FE86 E84A 497A
FE87 E84B 497D
FE88 E84C 4982
FE89 E84D 4983
FE8A E84E 4985
FE8B E84F 4986
FE8C E850 499F
FE8D E851 499B
FE8E E852 49B7
FE8F E853 49B6
FE90 E854 9FBA
FE91 E855 241FE
FE92 E856 4CA3
FE93 E857 4C9F
FE94 E858 4CA0
FE95 E859 4CA1
FE96 E85A 4C77
FE97 E85B 4CA2
FE98 E85C 4D13
FE99 E85D 4D14
FE9A E85E 4D15
FE9B E85F 4D16
FE9C E860 4D17
FE9D E861 4D18
FE9E E862 4D19
FE9F E863 4DAE
FEA0 E864 9FBB
The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's corresponding Unicode non-PUA characters are decided.
Han Character GBK Unicode PUA Unicode non-PUA
FE51 E816 20087
FE52 E817 20089
FE53 E818 200CC
FE59 E81E 9FB4
FE61 E826 9FB5
FE66 E82B 9FB6
FE67 E82C 9FB7
FE6C E831 215D7
FE6D E832 9FB8
FE76 E83B 2298F
FE7E E843 9FB9
FE90 E854 9FBA
FE91 E855 241FE
FEA0 E864 9FBB
And according to these, we can decode all GBK encoding family strings to non-PUA Unicode, Besides these, we still have the need to convert all the historical Unicode PUA characters to proper GBK(GB18030) characters.
I disagree. We shouldn't invent yet another new encoding anymore.
I tend to agree with @vyv03354. Since no implementation does this and developers are asked to use utf-8, I don't really see an upside here. This only increases the chance that things break.
@vyv03354 @annevk We are not invent new encoding, just getting exist encoding works.
Fair, changing an encoding is not inventing a new one. However, it is not clear why we should change it, since implementations mostly agree here.
@annevk @vyv03354 Please consider the following situation, suppose a have text with a Unicode character U20087, when I convert this character to GBK, What I should to do? 0xFE51 or other invalid character? So we are just refinement the exist convert table to the final state?
when I convert this character to GBK,
We don't convert any plane-2 characters in GBK encoder. It will be changed to a character reference (𠂇).
Japanese users were suffered from encoding "improvements" of JIS standards and industrial de-facto standards. Even one character change is considered as a new encoding in ISO coded character set standards. Such a change will have more harm than good even if it is out of good will.
@vyv03354 That's really different, cause JIS doesn't mapping any characters to PUA Unicode character, that's just because at that time, The Unicode is didn't have enough charset for GBK, but now it's has, that's totally different.
@lygstate as with the other issue, I recommend using utf-8 instead. I agree with @vyv03354 that changing implementations at this point is more likely to lead to breakage than happy users.
Will you add something like a line of "note" to the description for gb18030 in the spec mentioning this issue? PUA really brings a lot of issues to users as using its codepoints without a common agreement is like inventing a nationwide Unicode dialect.
To be frank I would rather leave the dialect pollution in the legacy encoder/decoder bridge than let it spread in the new world, so please consider adding:
- a flag that instructs the decoder to not emit PUA
- a flag that instructs the encoder to warn against PUA usage potentially resulting from GB18030-200{0,5} decoding
and as a basis for these changes,
- a mapping from "old world" PUAs to "new world" Unicode CJK Extensions.
See also:
- https://blogs.adobe.com/CCJKType/2015/03/to-gb18030.html
- http://www.unicode.org/L2/L2006/06394-gb18030-2005.txt
- https://en.wikipedia.org/wiki/Talk%3AGB_18030#The_need_for_a_new_mapping_table
- https://ssl.icu-project.org/docs/papers/unicode-gb18030-faq.html
@lygstate Could you please consider reopening this issue if you find my — um — attempt helpful?
@Artoria2e5 the "new world" should use utf-8 exclusively.
@annevk But we still need a way to migrade from the old world.
@annevk It's true that the modern world should use UTF-8 for information exchange, processing and storage. But given that character representations in UTF-8 relies on codepoints assigned in Unicode, it makes sense to use the formal, universal codepoint assignments in this universal encoding.
As stated previously, by emitting PUA codepoints in the decoder, you are speaking in a Unicode dialect codepoint-wise, resulting in a less interchangeable UTF-8 variant, thus contradicting the point of using UTF-8 everywhere. (The use of PUA here cannot be justified by a lack of definition as these ideographs do have formal assignments.) The encoder part is more about discouraging old PUA usage.
But we still need a way to migra[t]e from the old world.
And we need to make sure that the way gives us actual "new world" stuff.
By the way, there should be 24 PUA codepoints in the 2005 standard instead of 14, according to the L2/06-394 "Update on GB 18030:2005" by Ken Lunde.
An interesting but sad example of this dialect split can be shown using the character U+20087 (𠂇), assigned to PUA codepoint U+E816 () in the mapping. Search engines like Google won't do normalization on PUA forms where several different sets of agreements exist, and you can see it from the search results.
Given that no browser implements gb18030 like that I don't see why we should change this. We could easily break those relying on these bytes mapping to PUA. I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion and those maintaining gb18030 have not decided to care.
The GB18030 mapping is naturally fungible wrt PUA characters, since Unicode continues to encode Chinese code points. I think this should be recognized by Encoding.
I agree that we should not remove mapping of Unicode PUA -> GB18030 (compatibility). But the problem here is round-tripping of real Unicode code points with GB18030.
If I have a U+20087, convert it to GB18030, and the later reserialize the GB data as UTF-8, I will get back U+E816 rather than the original (and correct) code point. That's undesirable and a loss of information. The fact that existing implementations haven't caught up with standardization doesn't mean that we shouldn't make this change.
@annevk Under what circumstances would we change? One of the problems with establishing a standard is that implementations are trying hard to be compliant with it...
@aphillips what standard are we talking about? The standard for gb18030 has that loss of information and Encoding doesn't modify it (it does modify some other parts).
Given that no browser implements gb18030 like that I don't see why we should change this.
Newer Pan-CJK font families like Adobe's Source Han Sans (lead by @kenlunde) decide to go with Unicode instead of GB 10830-flavored Unicode.
I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion
Dr. Ken "Someone" Lunde (again!) is among the editors of UAX 38 Unihan database, and has very extensive participation of many CJK-related standardization processes in Unicode.
and those maintaining gb18030 have not decided to care.
The Chinese SAC has decided not to care about a lot of things including their translations of ANSI C (GB/T 15272:1994, ISO/IEC 9899:1990) and UCS (GB 13000:2010, ISO/IEC 10646:2003). But this lag doesn't mean that the Chinese are not using newer revisions of the C language and Unicode. The same should apply to the UCS references in GB 18030:2005.
2016-09-12: Found out that W3C (well, that sounds impractical) has some rules regarding using PUA in i18n specs.
@Artoria2e5: The reasons why Source Han Sans (and the Google-branded Noto Sans CJK) does not support the 24 PUA code points of GB 18030 are because 1) PUA code points should be avoided in general; 2) PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts; 3) a GB 18030 revision is expected to be published soon that will specify the non-PUA code points for these 24 characters, which will effectively lift the PUA requirement; 4) the 24 characters have had non-PUA code points for over a decade; and 5) the "release" branch of Source Han Sans includes a utf32-gb18030pua24.map file that provides the 24 PUA mappings for those developers who need support for these PUA code points.
PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts;
Hmm, I guess that an encoding spec for dealing with legacy encodings also falls into the scope of "mixing multiple standards". It looks like reasons 1–4 are on my side...
I guess if gb18030 is actually revised there is a chance web-focused implementations might want to change their mapping. If that happens and implementations indeed want to make a backwards incompatible change someone should raise a new issue.
This issue was raised by me last year-early this year in #22 (and https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 ).
As I wrote there, the current mapping makes it impossible to display those characters involved [1] on some platforms (Android and Windows 10 [2]) when they're encoded in GB 18030 because there is NO font covering the corresponding PUA code points. This is one of the most serious consequences of the current mapping to me (besides other consequences mentioned earlier).
OTOH, if there are multiple fonts covering those PUA points with different interpretations, there's no easy way to pick the right one (if the only information at hand is code points) because the identify of a PUA code point is up to private parties and is indeterministic by definition.
( Needless to say, there'd be no such problem if UTF-8 is used with regular code points and we want everybody to use UTF-8 on the web. )
Given all these, removing any mapping to PUA code points (as long as there are regular Unicode characters) is desired.
As mentioned in #22, I initially thought that GB18030:2005 had fixed all these up (by 2005, all the characters originally mapped to PUA code points had been encoded in the Unicode) in a way similar to what's done for HKSCS. It turned out that that was not the case, which was rather disappointing. As a result (and 24 characters affected are rarely used - especially the U+FE1x) , the change for #22 was minimal (only one code point was fixed per GB18030:2005).
Given that GB18030 will be revised soon (per @kenlunde) to eliminate the canonical mapping to PUA code points, Chromium is more than willing to go ahead with mapping the 24 byte-sequences in GB18030 to regular Unicode characters.
[1] In addition to the 14 CJK ideographs/radicals listed earlier, there are vertical form variants that are still mapped to PUA code points. (well, U+FE1x will be virtually unused in gb18030-encoded documents). \xA6\xD9 U+E78D U+0fe10 \xA6\xDA U+E78E U+0fe12 \xA6\xDB U+E78F U+0fe11 \xA6\xDC U+E790 U+0fe13 \xA6\xDD U+E791 U+0fe14 \xA6\xDE U+E792 U+0fe15 \xA6\xDF U+E793 U+0fe16 \xA6\xEC U+E794 U+0fe17 \xA6\xED U+E795 U+0fe18 \xA6\xF3 U+E796 U+0fe19 \xFE\x51 U+E816 U+20087 \xFE\x52 U+E817 U+20089 \xFE\x53 U+E818 U+200cc \xFE\x59 U+E81E U+09fb4 \xFE\x61 U+E826 U+09fb5 \xFE\x66 U+E82B U+09fb6 \xFE\x67 U+E82C U+09fb7 \xFE\x6C U+E831 U+215d7 \xFE\x6D U+E832 U+09fb8 \xFE\x76 U+E83B U+2298f \xFE\x7E U+E843 U+09fb9 \xFE\x90 U+E854 U+09fba \xFE\x91 U+E855 U+241fe \xFE\xA0 U+E864 U+09fbb
[2] Android (at least Google's Nexus devices) does not have any font covering the PUA code points listed in [1]. Out of the box (perhaps unless your UI language is Simplified Chinese), Windows 10 does not have Simsun with the PUA code point coverage while it has a newer Chinese font - Microsoft YaHei - with the corresponding regular code point coverage. One can manually add Simsun, though. At the moment, Chrome OS does have a font covering them (MSung GB18030), but may not in the future.
@kenlunde any updates on gb18030 revisions? Something that can be tracked perhaps?
@annevk: I will ping my contact at CESI in China to get the current status of the GB 18030 revision.
@annevk: My contact at CESI told me that a draft of the GB 18030 is expected to be available sometime this year, and is expected to fix known issues, such as this one and the presence of PUA code points when a non-PUA code point is available.
the presence of PUA code points when a non-PUA code point is available.
Will this result in two different byte sequences (two-byte and four-byte) decoding to the same code point for some code points?
Will the PUA code points that previously had two-byte representations be left without a representation?
(I have doubts that changing what a legacy encoding means in terms of mapping to Unicode at this point is a net positive change even if well-intentioned.)
Sufficient time has passed that to implement GB 18030 in any encoding other than Unicode makes no sense. The main benefit of the GB 18030 revision is to simply remove the PUA requirement from the GB 18030 certification process. Font implementation that map from those 24 PUA code points, to be GB 18030–compliant, should already be double-mapping from the corresponding 24 non-PUA code points.
If the purpose is to simplify the Unicode subset support certification aspect of GB18030, why is the legacy encoding aspect being changed also?
@hsivonen: It is a bit premature to know exactly what changes to the legacy encoding will change in the forthcoming GB 18030 update.
Consider a couple prototypical examples from the 24 characters that currently map to PUA code points:
0xA6D9 currently maps to U+E78D, but the non-PUA equivalent is U+FE10. The GB 18030-2005 standard indicates that U+FE10 corresponds to 0x84318236.
0xFE51 currently maps to U+E816, but the non-PUA equivalent is U+20087. The GB 18030-2005 standard indicates that U+20087 corresponds to 0x95329031.
The mapping for one of the characters in GB 18030-2000 was changed in the 2005 update, which gives us a glimpse about what is likely to change in the forthcoming update:
0xA8BC originally mapped to U+E7C7, but the 2005 update changed the mapping to U+1E3F, which originally mapped from 0x8135F437. 0x8135F437 now maps to U+E7C7. Following this precedent, I would expect the two examples to be changed change to the following:
0xA6D9 → U+FE10 0x84318236 → U+E78D
0xFE51 → U+20087 0x95329031 → U+E816
I prepared a complete gb-18030-pua-changes.txt datafile that indicates the PUA change that occurred in the 2005 update, and what we can expect for the forthcoming update for the 24 remaining PUA characters by applying the same pattern.
0x84318236 → U+E78D ... 0x95329031 → U+E816
It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA.
I don't see how that could be a good thing for any practical interop purpose. (I can see how that could seem appealing to the theory that the GB18030 encoding is a bijective UTF, but that's already not the case as far as the Web is concerned due to U+3000 being double-mapped and U+E5E5 being unmappable.)
Right. I was merely showing one possible way in which China may change GB 18030 to remove the PUA requirement, by applying the pattern that was used in the 2005 update. The single mapping change in the 2005 update may have been one-off–ish enough that China figured it would be harmless, but 24 mapping changes may be a bit much to swallow at once.
The history of GB 18030 goes back to GBK, which included significantly more PUA mappings, a little over 100. The ones that could be changed to non-PUA mappings were changed, and only 25 remained in GB 18030-2000, in terms of the "required" portion.
The other way to handle to remove the PUA requirement, to keep the mapping stable, is to first remove the requirement to support the following 24 characters:
0xA6D9 -> U+E78D 0xA6DA -> U+E78E 0xA6DB -> U+E78F 0xA6DC -> U+E790 0xA6DD -> U+E791 0xA6DE -> U+E792 0xA6DF -> U+E793 0xA6EC -> U+E794 0xA6ED -> U+E795 0xA6F3 -> U+E796 0xFE51 -> U+E816 0xFE52 -> U+E817 0xFE53 -> U+E818 0xFE59 -> U+E81E 0xFE61 -> U+E826 0xFE66 -> U+E82B 0xFE67 -> U+E82C 0xFE6C -> U+E831 0xFE6D -> U+E832 0xFE76 -> U+E83B 0xFE7E -> U+E843 0xFE90 -> U+E854 0xFE91 -> U+E855 0xFEA0 -> U+E864
And second, to require the following 24 characters:
0x82359037 -> U+9FB4 0x82359038 -> U+9FB5 0x82359039 -> U+9FB6 0x82359130 -> U+9FB7 0x82359131 -> U+9FB8 0x82359132 -> U+9FB9 0x82359133 -> U+9FBA 0x82359134 -> U+9FBB 0x84318236 -> U+FE10 0x84318237 -> U+FE11 0x84318238 -> U+FE12 0x84318239 -> U+FE13 0x84318330 -> U+FE14 0x84318331 -> U+FE15 0x84318332 -> U+FE16 0x84318333 -> U+FE17 0x84318334 -> U+FE18 0x84318335 -> U+FE19 0x95329031 -> U+20087 0x95329033 -> U+20089 0x95329730 -> U+200CC 0x9536B937 -> U+215D7 0x9630BA35 -> U+2298F 0x9635B630 -> U+241FE
My guess is that the original 24 characters, in terms of supporting their mappings, will be changed from "required" to "optional," and that the additional 24 characters will be changed from "optional" to "required" if the original 24 characters are not supported. Or, something to that effect.