encoding icon indicating copy to clipboard operation
encoding copied to clipboard

If gb18030 is revised, consider aligning the Encoding Standard

Open lygstate opened this issue 9 years ago • 44 comments

Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except The 14 characters that still mapped into Unicode PUA that at 2005, But nowadays, all the 14 characters have correlated mapping into Unicode, So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.

The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and the corresponding Unicode non-PUA character

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE50                E815                2E81
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE54                E819                2E84
                FE55                E81A                3473
                FE56                E81B                3447
                FE57                E81C                2E88
                FE58                E81D                2E8B
                FE59                E81E                9FB4
                FE5A                E81F                359E
                FE5B                E820                361A
                FE5C                E821                360E
                FE5D                E822                2E8C
                FE5E                E823                2E97
                FE5F                E824                396E
                FE60                E825                3918
                FE61                E826                9FB5
                FE62                E827                39CF
                FE63                E828                39DF
                FE64                E829                3A73
                FE65                E82A                39D0
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE68                E82D                3B4E
                FE69                E82E                3C6E
                FE6A                E82F                3CE0
                FE6B                E830                2EA7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE6E                E833                2EAA
                FE6F                E834                4056
                FE70                E835                415F
                FE71                E836                2EAE
                FE72                E837                4337
                FE73                E838                2EB3
                FE74                E839                2EB6
                FE75                E83A                2EB7
                FE76                E83B                2298F
                FE77                E83C                43B1
                FE78                E83D                43AC
                FE79                E83E                2EBB
                FE7A                E83F                43DD
                FE7B                E840                44D6
                FE7C                E841                4661
                FE7D                E842                464C
                FE7E                E843                9FB9
                FE80                E844                4723
                FE81                E845                4729
                FE82                E846                477C
                FE83                E847                478D
                FE84                E848                2ECA
                FE85                E849                4947
                FE86                E84A                497A
                FE87                E84B                497D
                FE88                E84C                4982
                FE89                E84D                4983
                FE8A                E84E                4985
                FE8B                E84F                4986
                FE8C                E850                499F
                FE8D                E851                499B
                FE8E                E852                49B7
                FE8F                E853                49B6
                FE90                E854                9FBA
                FE91                E855                241FE
                FE92                E856                4CA3
                FE93                E857                4C9F
                FE94                E858                4CA0
                FE95                E859                4CA1
                FE96                E85A                4C77
                FE97                E85B                4CA2
                FE98                E85C                4D13
                FE99                E85D                4D14
                FE9A                E85E                4D15
                FE9B                E85F                4D16
                FE9C                E860                4D17
                FE9D                E861                4D18
                FE9E                E862                4D19
                FE9F                E863                4DAE
                FEA0                E864                9FBB

The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's corresponding Unicode non-PUA characters are decided.

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE59                E81E                9FB4
                FE61                E826                9FB5
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE76                E83B                2298F
                FE7E                E843                9FB9
                FE90                E854                9FBA
                FE91                E855                241FE
                FEA0                E864                9FBB

And according to these, we can decode all GBK encoding family strings to non-PUA Unicode, Besides these, we still have the need to convert all the historical Unicode PUA characters to proper GBK(GB18030) characters.

lygstate avatar Jan 17 '16 16:01 lygstate

I disagree. We shouldn't invent yet another new encoding anymore.

vyv03354 avatar Jan 17 '16 16:01 vyv03354

I tend to agree with @vyv03354. Since no implementation does this and developers are asked to use utf-8, I don't really see an upside here. This only increases the chance that things break.

annevk avatar Jan 17 '16 18:01 annevk

@vyv03354 @annevk We are not invent new encoding, just getting exist encoding works.

lygstate avatar Jan 17 '16 19:01 lygstate

Fair, changing an encoding is not inventing a new one. However, it is not clear why we should change it, since implementations mostly agree here.

annevk avatar Jan 18 '16 08:01 annevk

@annevk @vyv03354 Please consider the following situation, suppose a have text with a Unicode character U20087, when I convert this character to GBK, What I should to do? 0xFE51 or other invalid character? So we are just refinement the exist convert table to the final state?

lygstate avatar Jan 19 '16 12:01 lygstate

when I convert this character to GBK,

We don't convert any plane-2 characters in GBK encoder. It will be changed to a character reference (𠂇).

Japanese users were suffered from encoding "improvements" of JIS standards and industrial de-facto standards. Even one character change is considered as a new encoding in ISO coded character set standards. Such a change will have more harm than good even if it is out of good will.

vyv03354 avatar Jan 19 '16 12:01 vyv03354

@vyv03354 That's really different, cause JIS doesn't mapping any characters to PUA Unicode character, that's just because at that time, The Unicode is didn't have enough charset for GBK, but now it's has, that's totally different.

lygstate avatar Jan 19 '16 13:01 lygstate

@lygstate as with the other issue, I recommend using utf-8 instead. I agree with @vyv03354 that changing implementations at this point is more likely to lead to breakage than happy users.

annevk avatar Jan 20 '16 15:01 annevk

Will you add something like a line of "note" to the description for gb18030 in the spec mentioning this issue? PUA really brings a lot of issues to users as using its codepoints without a common agreement is like inventing a nationwide Unicode dialect.

To be frank I would rather leave the dialect pollution in the legacy encoder/decoder bridge than let it spread in the new world, so please consider adding:

  • a flag that instructs the decoder to not emit PUA
  • a flag that instructs the encoder to warn against PUA usage potentially resulting from GB18030-200{0,5} decoding

and as a basis for these changes,

  • a mapping from "old world" PUAs to "new world" Unicode CJK Extensions.

See also:

  • https://blogs.adobe.com/CCJKType/2015/03/to-gb18030.html
  • http://www.unicode.org/L2/L2006/06394-gb18030-2005.txt
  • https://en.wikipedia.org/wiki/Talk%3AGB_18030#The_need_for_a_new_mapping_table
  • https://ssl.icu-project.org/docs/papers/unicode-gb18030-faq.html

@lygstate Could you please consider reopening this issue if you find my — um — attempt helpful?

gbk-gb18030-pua.txt

Artoria2e5 avatar Sep 05 '16 19:09 Artoria2e5

@Artoria2e5 the "new world" should use utf-8 exclusively.

annevk avatar Sep 06 '16 08:09 annevk

@annevk But we still need a way to migrade from the old world.

lygstate avatar Sep 06 '16 13:09 lygstate

@annevk It's true that the modern world should use UTF-8 for information exchange, processing and storage. But given that character representations in UTF-8 relies on codepoints assigned in Unicode, it makes sense to use the formal, universal codepoint assignments in this universal encoding.

As stated previously, by emitting PUA codepoints in the decoder, you are speaking in a Unicode dialect codepoint-wise, resulting in a less interchangeable UTF-8 variant, thus contradicting the point of using UTF-8 everywhere. (The use of PUA here cannot be justified by a lack of definition as these ideographs do have formal assignments.) The encoder part is more about discouraging old PUA usage.

But we still need a way to migra[t]e from the old world.

And we need to make sure that the way gives us actual "new world" stuff.


By the way, there should be 24 PUA codepoints in the 2005 standard instead of 14, according to the L2/06-394 "Update on GB 18030:2005" by Ken Lunde.


An interesting but sad example of this dialect split can be shown using the character U+20087 (𠂇), assigned to PUA codepoint U+E816 () in the mapping. Search engines like Google won't do normalization on PUA forms where several different sets of agreements exist, and you can see it from the search results.

Artoria2e5 avatar Sep 06 '16 15:09 Artoria2e5

Given that no browser implements gb18030 like that I don't see why we should change this. We could easily break those relying on these bytes mapping to PUA. I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion and those maintaining gb18030 have not decided to care.

annevk avatar Sep 06 '16 16:09 annevk

The GB18030 mapping is naturally fungible wrt PUA characters, since Unicode continues to encode Chinese code points. I think this should be recognized by Encoding.

I agree that we should not remove mapping of Unicode PUA -> GB18030 (compatibility). But the problem here is round-tripping of real Unicode code points with GB18030.

If I have a U+20087, convert it to GB18030, and the later reserialize the GB data as UTF-8, I will get back U+E816 rather than the original (and correct) code point. That's undesirable and a loss of information. The fact that existing implementations haven't caught up with standardization doesn't mean that we shouldn't make this change.

@annevk Under what circumstances would we change? One of the problems with establishing a standard is that implementations are trying hard to be compliant with it...

aphillips avatar Sep 06 '16 17:09 aphillips

@aphillips what standard are we talking about? The standard for gb18030 has that loss of information and Encoding doesn't modify it (it does modify some other parts).

annevk avatar Sep 06 '16 17:09 annevk

Given that no browser implements gb18030 like that I don't see why we should change this.

Newer Pan-CJK font families like Adobe's Source Han Sans (lead by @kenlunde) decide to go with Unicode instead of GB 10830-flavored Unicode.

I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion

Dr. Ken "Someone" Lunde (again!) is among the editors of UAX 38 Unihan database, and has very extensive participation of many CJK-related standardization processes in Unicode.

and those maintaining gb18030 have not decided to care.

The Chinese SAC has decided not to care about a lot of things including their translations of ANSI C (GB/T 15272:1994, ISO/IEC 9899:1990) and UCS (GB 13000:2010, ISO/IEC 10646:2003). But this lag doesn't mean that the Chinese are not using newer revisions of the C language and Unicode. The same should apply to the UCS references in GB 18030:2005.


2016-09-12: Found out that W3C (well, that sounds impractical) has some rules regarding using PUA in i18n specs.

Artoria2e5 avatar Sep 06 '16 17:09 Artoria2e5

@Artoria2e5: The reasons why Source Han Sans (and the Google-branded Noto Sans CJK) does not support the 24 PUA code points of GB 18030 are because 1) PUA code points should be avoided in general; 2) PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts; 3) a GB 18030 revision is expected to be published soon that will specify the non-PUA code points for these 24 characters, which will effectively lift the PUA requirement; 4) the 24 characters have had non-PUA code points for over a decade; and 5) the "release" branch of Source Han Sans includes a utf32-gb18030pua24.map file that provides the 24 PUA mappings for those developers who need support for these PUA code points.

kenlunde avatar Sep 06 '16 17:09 kenlunde

PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts;

Hmm, I guess that an encoding spec for dealing with legacy encodings also falls into the scope of "mixing multiple standards". It looks like reasons 1–4 are on my side...

Artoria2e5 avatar Sep 06 '16 17:09 Artoria2e5

I guess if gb18030 is actually revised there is a chance web-focused implementations might want to change their mapping. If that happens and implementations indeed want to make a backwards incompatible change someone should raise a new issue.

annevk avatar Sep 07 '16 07:09 annevk

This issue was raised by me last year-early this year in #22 (and https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 ).

As I wrote there, the current mapping makes it impossible to display those characters involved [1] on some platforms (Android and Windows 10 [2]) when they're encoded in GB 18030 because there is NO font covering the corresponding PUA code points. This is one of the most serious consequences of the current mapping to me (besides other consequences mentioned earlier).

OTOH, if there are multiple fonts covering those PUA points with different interpretations, there's no easy way to pick the right one (if the only information at hand is code points) because the identify of a PUA code point is up to private parties and is indeterministic by definition.

( Needless to say, there'd be no such problem if UTF-8 is used with regular code points and we want everybody to use UTF-8 on the web. )

Given all these, removing any mapping to PUA code points (as long as there are regular Unicode characters) is desired.

As mentioned in #22, I initially thought that GB18030:2005 had fixed all these up (by 2005, all the characters originally mapped to PUA code points had been encoded in the Unicode) in a way similar to what's done for HKSCS. It turned out that that was not the case, which was rather disappointing. As a result (and 24 characters affected are rarely used - especially the U+FE1x) , the change for #22 was minimal (only one code point was fixed per GB18030:2005).

Given that GB18030 will be revised soon (per @kenlunde) to eliminate the canonical mapping to PUA code points, Chromium is more than willing to go ahead with mapping the 24 byte-sequences in GB18030 to regular Unicode characters.

[1] In addition to the 14 CJK ideographs/radicals listed earlier, there are vertical form variants that are still mapped to PUA code points. (well, U+FE1x will be virtually unused in gb18030-encoded documents). \xA6\xD9 U+E78D U+0fe10 \xA6\xDA U+E78E U+0fe12 \xA6\xDB U+E78F U+0fe11 \xA6\xDC U+E790 U+0fe13 \xA6\xDD U+E791 U+0fe14 \xA6\xDE U+E792 U+0fe15 \xA6\xDF U+E793 U+0fe16 \xA6\xEC U+E794 U+0fe17 \xA6\xED U+E795 U+0fe18 \xA6\xF3 U+E796 U+0fe19 \xFE\x51 U+E816 U+20087 \xFE\x52 U+E817 U+20089 \xFE\x53 U+E818 U+200cc \xFE\x59 U+E81E U+09fb4 \xFE\x61 U+E826 U+09fb5 \xFE\x66 U+E82B U+09fb6 \xFE\x67 U+E82C U+09fb7 \xFE\x6C U+E831 U+215d7 \xFE\x6D U+E832 U+09fb8 \xFE\x76 U+E83B U+2298f \xFE\x7E U+E843 U+09fb9 \xFE\x90 U+E854 U+09fba \xFE\x91 U+E855 U+241fe \xFE\xA0 U+E864 U+09fbb

[2] Android (at least Google's Nexus devices) does not have any font covering the PUA code points listed in [1]. Out of the box (perhaps unless your UI language is Simplified Chinese), Windows 10 does not have Simsun with the PUA code point coverage while it has a newer Chinese font - Microsoft YaHei - with the corresponding regular code point coverage. One can manually add Simsun, though. At the moment, Chrome OS does have a font covering them (MSung GB18030), but may not in the future.

jungshik avatar Sep 10 '16 10:09 jungshik

@kenlunde any updates on gb18030 revisions? Something that can be tracked perhaps?

annevk avatar Mar 19 '17 14:03 annevk

@annevk: I will ping my contact at CESI in China to get the current status of the GB 18030 revision.

kenlunde avatar Mar 19 '17 14:03 kenlunde

@annevk: My contact at CESI told me that a draft of the GB 18030 is expected to be available sometime this year, and is expected to fix known issues, such as this one and the presence of PUA code points when a non-PUA code point is available.

kenlunde avatar Mar 20 '17 12:03 kenlunde

the presence of PUA code points when a non-PUA code point is available.

Will this result in two different byte sequences (two-byte and four-byte) decoding to the same code point for some code points?

Will the PUA code points that previously had two-byte representations be left without a representation?

(I have doubts that changing what a legacy encoding means in terms of mapping to Unicode at this point is a net positive change even if well-intentioned.)

hsivonen avatar Mar 20 '17 12:03 hsivonen

Sufficient time has passed that to implement GB 18030 in any encoding other than Unicode makes no sense. The main benefit of the GB 18030 revision is to simply remove the PUA requirement from the GB 18030 certification process. Font implementation that map from those 24 PUA code points, to be GB 18030–compliant, should already be double-mapping from the corresponding 24 non-PUA code points.

kenlunde avatar Mar 20 '17 15:03 kenlunde

If the purpose is to simplify the Unicode subset support certification aspect of GB18030, why is the legacy encoding aspect being changed also?

hsivonen avatar Mar 20 '17 18:03 hsivonen

@hsivonen: It is a bit premature to know exactly what changes to the legacy encoding will change in the forthcoming GB 18030 update.

Consider a couple prototypical examples from the 24 characters that currently map to PUA code points:

0xA6D9 currently maps to U+E78D, but the non-PUA equivalent is U+FE10. The GB 18030-2005 standard indicates that U+FE10 corresponds to 0x84318236.

0xFE51 currently maps to U+E816, but the non-PUA equivalent is U+20087. The GB 18030-2005 standard indicates that U+20087 corresponds to 0x95329031.

The mapping for one of the characters in GB 18030-2000 was changed in the 2005 update, which gives us a glimpse about what is likely to change in the forthcoming update:

0xA8BC originally mapped to U+E7C7, but the 2005 update changed the mapping to U+1E3F, which originally mapped from 0x8135F437. 0x8135F437 now maps to U+E7C7. Following this precedent, I would expect the two examples to be changed change to the following:

0xA6D9 → U+FE10 0x84318236 → U+E78D

0xFE51 → U+20087 0x95329031 → U+E816

kenlunde avatar Mar 20 '17 19:03 kenlunde

I prepared a complete gb-18030-pua-changes.txt datafile that indicates the PUA change that occurred in the 2005 update, and what we can expect for the forthcoming update for the 24 remaining PUA characters by applying the same pattern.

kenlunde avatar Mar 20 '17 21:03 kenlunde

0x84318236 → U+E78D ... 0x95329031 → U+E816

It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA.

I don't see how that could be a good thing for any practical interop purpose. (I can see how that could seem appealing to the theory that the GB18030 encoding is a bijective UTF, but that's already not the case as far as the Web is concerned due to U+3000 being double-mapped and U+E5E5 being unmappable.)

hsivonen avatar Mar 21 '17 07:03 hsivonen

Right. I was merely showing one possible way in which China may change GB 18030 to remove the PUA requirement, by applying the pattern that was used in the 2005 update. The single mapping change in the 2005 update may have been one-off–ish enough that China figured it would be harmless, but 24 mapping changes may be a bit much to swallow at once.

The history of GB 18030 goes back to GBK, which included significantly more PUA mappings, a little over 100. The ones that could be changed to non-PUA mappings were changed, and only 25 remained in GB 18030-2000, in terms of the "required" portion.

The other way to handle to remove the PUA requirement, to keep the mapping stable, is to first remove the requirement to support the following 24 characters:

0xA6D9 -> U+E78D 0xA6DA -> U+E78E 0xA6DB -> U+E78F 0xA6DC -> U+E790 0xA6DD -> U+E791 0xA6DE -> U+E792 0xA6DF -> U+E793 0xA6EC -> U+E794 0xA6ED -> U+E795 0xA6F3 -> U+E796 0xFE51 -> U+E816 0xFE52 -> U+E817 0xFE53 -> U+E818 0xFE59 -> U+E81E 0xFE61 -> U+E826 0xFE66 -> U+E82B 0xFE67 -> U+E82C 0xFE6C -> U+E831 0xFE6D -> U+E832 0xFE76 -> U+E83B 0xFE7E -> U+E843 0xFE90 -> U+E854 0xFE91 -> U+E855 0xFEA0 -> U+E864

And second, to require the following 24 characters:

0x82359037 -> U+9FB4 0x82359038 -> U+9FB5 0x82359039 -> U+9FB6 0x82359130 -> U+9FB7 0x82359131 -> U+9FB8 0x82359132 -> U+9FB9 0x82359133 -> U+9FBA 0x82359134 -> U+9FBB 0x84318236 -> U+FE10 0x84318237 -> U+FE11 0x84318238 -> U+FE12 0x84318239 -> U+FE13 0x84318330 -> U+FE14 0x84318331 -> U+FE15 0x84318332 -> U+FE16 0x84318333 -> U+FE17 0x84318334 -> U+FE18 0x84318335 -> U+FE19 0x95329031 -> U+20087 0x95329033 -> U+20089 0x95329730 -> U+200CC 0x9536B937 -> U+215D7 0x9630BA35 -> U+2298F 0x9635B630 -> U+241FE

My guess is that the original 24 characters, in terms of supporting their mappings, will be changed from "required" to "optional," and that the additional 24 characters will be changed from "optional" to "required" if the original 24 characters are not supported. Or, something to that effect.

kenlunde avatar Mar 21 '17 12:03 kenlunde