bsdconv icon indicating copy to clipboard operation
bsdconv copied to clipboard

GB2312's em dash

Open Artoria2e5 opened this issue 9 years ago • 4 comments

bsdconv's GB2312 table which comes from unicode.org and went missing after EASTASIA charts became obsolete is, to some extent, similar to Unicode's Big5 table in quality. (I will use unicode.org's whatever hex to refer to GB codepoints, so add 0x8080 for EUC-CN.)

In GB2312-1980, 212A is defined as 破折号 (em dash), but the Unicode mapping gives a U+2015 (horizontal bar) instead of U+2014, apparently without reading the Chinese text at all. Hence GB2312's decoder should be changed to emit U+2014 just for proper punctuation; the encoder should be made to accept U+2014 too.

By the way, 212A is one of "Unicode" gb2312-80's incompatibilities with GBK; the other one is at 2124. You may choose to use a non-fullwidth, regular "middle dot" as GBK does and W3C CLREQ recommends typographically, but what I hope for now is just the encoder accepting U+00B7.

Artoria2e5 avatar Dec 01 '16 17:12 Artoria2e5

Please feel free to change anything about simplified chinese, since I am not native user for it, the current state is just enough for my previous use cases.

buganini avatar Dec 01 '16 18:12 buganini

Sure.

Artoria2e5 avatar Dec 01 '16 18:12 Artoria2e5

Wait... With #17 how did it even work...

Artoria2e5 avatar Dec 01 '16 18:12 Artoria2e5

You can add/rewrite encoder/decoder and/or replace or add aliases..

Aliases are defined in https://github.com/buganini/bsdconv/blob/master/modules/from/alias and https://github.com/buganini/bsdconv/blob/master/modules/to/alias

After changing alias files, make alias will update https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-FROM.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-INTER.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-TO.txt https://github.com/buganini/bsdconv/blob/master/modules/inter/ALIAS-FILTER.txt

Big5 is using UAO250 as default decoder and CP950 as default encoder to achieve maximum compatibility for practical use.

buganini avatar Dec 01 '16 18:12 buganini