iconv-lite
iconv-lite copied to clipboard
U+301C and U+FF5E are not correctly mapped in EUC-JP/Shift_JIS/CP932
WAVE DASH U+301C 〜
and FULLWIDTH TILDA U+FF5E ~
have almost the same glyph, but different code points. WAVE DASH(1-33 on JISX0208) should be mapped to U+301C, but the iconv-lite maps it to U+FF5E. The mappings are incorrect in EUC-JP, Shift_JIS, and CP932.
Convert with iconv-lite
Unicode | -> | EUC-JP | -> | UNICODE |
---|---|---|---|---|
U+301C | -> | 3F(no map) | ||
U+FF5E | -> | 8F A2 B7 | -> | U+FF5E |
A1 C1 | -> | U+FF5E |
Unicode | -> | Shift_JIS/CP932 | -> | Unicode |
---|---|---|---|---|
U+301C | -> | 3F(no map) | ||
U+FF5E | -> | 81 60 | -> | U+FF5E |
Convert with libiconv
Unicode | -> | EUC-JP | -> | Unicode |
---|---|---|---|---|
U+301C | -> | A1 C1 | -> | U+301C |
U+FF5E | -> | 8F A2 B7 | -> | U+FF5E |
Unicode | -> | Shift_JIS | -> | Unicode |
---|---|---|---|---|
U+301C | -> | 81 60 | -> | U+301C |
U+FF5E | -> | (no map) |
Unicode | -> | CP932 | -> | Unicode |
---|---|---|---|---|
U+301C | -> | 81 60 | -> | U+301C |
U+FF5E | -> | 81 60 |
Hey raccy, thanks for filing this issue.
In multibyte encodings, iconv-lite tries its best to mirror the WHATWG Encoding Standard. I just checked it out and it maps symbol 1-33 to U+FF5E, see this and this.
Do you have other sources except libiconv that map 1-33 to U+301C? You might want to file an issue to encoding standard issue tracker. I see there's some minor discussion there about it.
I can probably add the encoding pair U+301C -> 81 60 for Shift_JIS and CP932 to be more flexible, but for the decoding part I currently aim to follow encoding standard.
What do you think?
Hi ashtuchkin,
raccy is right. U+FF5E is a mapping according to Microsoft Code Page (cp932) which is not authorized by public standards body. U+301C is the mapping according to Japan Industrial Standard (JIS X 0208).
- Shift_JIS would be better to conform to JIS X 0208: Detailed encoding scheme is defined in Annex 1 of this standard.
- EUC-JP would be better to conform to eucjp-ascii defined by OSF/JVC. Though it is not a national standard, it is identical to x-eucjp-open-19970715-ascii listed in XML Japanese Profile.
Even more characters are also given imcompatible mappings over two mappings above. It is quite a mess for japanese users. If you prefer, I'd like to provide changes.
Thanks for chiming in, Ikedas. What do you think of discussion of the same issue at the encoding standard tracker: https://github.com/whatwg/encoding/issues/47 ?
Note to self: Ambiguities can be see here: https://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/#ambiguity_of_yen
takahashim's suggestion looks reasonable for me. Current index-jis0208.txt would be renamed to index-windows31j.txt or similar. Appropriate names would be assigned to appropriate mappings.
(Problem on indices beyond 8836 (94 × 94) would be separate matter. They are simply beyond the domain of definition for CCS by ISO/IEC, i.e. domain of extension by vendors.)
On ambiguity, several implimentations adds one-way (Unicode to legacy) mappings for non-standard encoding, e.g. U+2015 HORIZONTAL BAR to \xA1\xBD EM DASH, therefore roundtrip conversion between cp932-based and JIS-based mappings is more or less satisfied.
(Addition) As takahashim pointed out, mapping defined by JIS X 0213 is rarely used in practice. It's an extension to JIS X 0208 but not compatible.
Thank you for your reply, ashtuchkin.
I found these files.
- ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
- ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
These are in the OBSOLETE directory, but libiconv probably used these maps.
I don't object to iconv-lite being based on WHATWG Encoding Standard, and I think that is a good policy. But I think that there are two problems.
- iconv-lite is different from the behavior of node-iconv. This confuse us. (See the code below, and run)
- JIS X 0208 takes precedence over JIS X 0212 (beginnig with 8F), and depending on the implementation, JIS X 0212 may not be supported by EUC-JP.
This code is check enocde/decode with icnov-lite and node-iconv.
const Iconv = require('iconv').Iconv;
const lite = require('iconv-lite');
const unicodePoint = s => 'U+' + s.charCodeAt(0).toString(16).toUpperCase();
const bufferString = buf => {
let s = '[ '
for (const b of buf) {
s += b.toString(16).toUpperCase();
s += ' ';
}
s += ']';
return s;
};
const up = (m, s) => console.log(m + ' : ' + unicodePoint(s));
const bp = (m, b) => console.log(m + ' -> ' + bufferString(b));
const ubp = (m, b) => up(m, b.toString('utf8'));
const waveDash = '\u301C';
const fwTilde = '\uFF5E';
const eucjp_A1C1 = new Buffer([0xA1, 0xC1]); // JISX0208 1-33 on EUC-JP
const eucjp_8FA2B7 = new Buffer([0x8F, 0xA2, 0xB7]); // JISX0212 1-23 on EUC-JP
const sjis_8160 = new Buffer([0x81, 0x60]); // JISX0208 1-33 on Shift_JIS
console.log('---- Unicode ----');
up('WAVE DASH', waveDash);
up('FULLWIDTH TILDE: ', fwTilde);
console.log();
console.log('---- iconv-lite ----');
up('EUC-JP A1 C1', lite.decode(eucjp_A1C1, 'eucjp'));
up('EUC-JP 8F A2 B7', lite.decode(eucjp_8FA2B7, 'eucjp'));
bp('WAVE DASH to EUC-JP', lite.encode(waveDash, 'eucjp'));
bp('FULLWIDTH TILDE to EUC-JP', lite.encode(fwTilde, 'eucjp'));
console.log();
up('Shift_JIS 81 60', lite.decode(sjis_8160, 'shift_jis'));
bp('WAVE DASH to Shift_JIS', lite.encode(waveDash, 'shift_jis'));
bp('FULLWIDTH TILDE to Shift_JIS', lite.encode(fwTilde, 'shift_jis'));
console.log();
up('CP932 81 60', lite.decode(sjis_8160, 'cp932'));
bp('WAVE DASH to CP932', lite.encode(waveDash, 'cp932'));
bp('FULLWIDTH TILDE to CP932', lite.encode(fwTilde, 'cp932'));
console.log();
console.log('---- node-iconv ----');
const utf8_waveDash = Buffer.from(waveDash, 'utf8');
const utf8_fwTilde = Buffer.from(fwTilde, 'utf8');
const e2u_iconv = new Iconv('EUC-JP', 'UTF-8');
const u2e_iconv = new Iconv('UTF-8', 'EUC-JP');
ubp('EUC-JP A1 C1', e2u_iconv.convert(eucjp_A1C1));
ubp('EUC-JP 8F A2 B7', e2u_iconv.convert(eucjp_8FA2B7));
bp('WAVE DASH to EUC-JP', u2e_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to EUC-JP', u2e_iconv.convert(utf8_fwTilde));
console.log();
const s2u_iconv = new Iconv('Shift_JIS', 'UTF-8');
const u2s_iconv = new Iconv('UTF-8', 'Shift_JIS');
ubp('Shift_JIS 81 60', s2u_iconv.convert(sjis_8160));
bp('WAVE DASH to Shift_JIS', u2s_iconv.convert(utf8_waveDash));
try {
// Error: Illegal character sequence
bp('FULLWIDTH TILDE to Shift_JIS', u2s_iconv.convert(utf8_fwTilde));
} catch (e) {
console.log('FULLWIDTH TILDE to Shift_JIS <ERROR> ' + e.message);
}
console.log();
const c2u_iconv = new Iconv('CP932', 'UTF-8');
const u2c_iconv = new Iconv('UTF-8', 'CP932');
ubp('CP932 81 60', c2u_iconv.convert(sjis_8160));
bp('WAVE DASH to CP932', u2c_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to CP932', u2c_iconv.convert(utf8_fwTilde));
Mappings on unicode.org may not be compatible to other implementation, e.g. 0x815C / 0x213D is mapped to U+2015 HORIZONTAL BAR. Personally I believe mapping defined by JIS (it is only mapping publicly authorized by ISO/IEC 10646) should be referred, however, investigation on existing implimentations is useful.
I suggest that at least 10 mappings mentioned above would be checked (both on forward and reverse mappings) to compare implementations. Additionally, duplicate mappings such as U+2116 NUMERO SIGN (both JIS X 0208 and JIS X 0212 have it) would be considered.
I compiled tables to help comparing implementations.
-
IMO, “Canonic” in the tables below would provide bi-directional conversion (from and to Unicode), while others would provide only reverse (from Unicode) or forward (to Unicode) conversion.
-
Note that tables below focuses on EUC-JP implementations. They are not necessarily applicable to Shift_JIS / cp932.
Following table shows vendor-dependent mappings. That is, beyond implementations, single code point on legacy character set can be mapped to multiple Unicode characters.
Canonic | Microsoft | JIS X 0208 Annex 5 | Code Point |
---|---|---|---|
U+203E | U+FFE3 | U+FFE3 | A1B1 |
U+2014 | U+2015 | A1BD | |
U+301C | U+FF5E | A1C1 | |
U+2016 | U+2225 | A1C2 | |
U+2212 | U+FF0D | A1DD | |
U+00A5 | U+FFE5 | U+FFE5 | A1EF |
U+00A2 | U+FFE0 | A1F1 | |
U+00A3 | U+FFE1 | A1F2 | |
U+00AC | U+FFE2 | A2CC | |
U+00A6 | U+FFE4 | 8FA2C3 |
Following table shows non-injective mappings. That is, beyond implementations, multiple code points on legacy character set will be mapped to single Unicode character.
Canonic | JIS X 0212 | IBM/NEC ext. | Unicode |
---|---|---|---|
ADE2 | 8FA2F1 | 8FF4AC | U+2116 |
ADE4 | 8FF4AD | U+2121 | |
ADB5 | 8FF3FD | U+2160 | |
ADB6 | 8FF3FE | U+2161 | |
ADB7 | 8FF4A1 | U+2162 | |
ADB8 | 8FF4A2 | U+2163 | |
ADB9 | 8FF4A3 | U+2164 | |
ADBA | 8FF4A4 | U+2165 | |
ADBB | 8FF4A5 | U+2166 | |
ADBC | 8FF4A6 | U+2167 | |
ADBD | 8FF4A7 | U+2168 | |
ADBE | 8FF4A8 | U+2169 | |
A2E5 | ADF5 | U+221A | |
A2DC | ADF7 | U+2220 | |
A2C1 | ADFB | U+2229 | |
A2C0 | ADFC | U+222A | |
A2E9 | ADF2 | U+222B | |
A2E8 | ADFA | U+2235 | |
A2E2 | ADF0 | U+2252 | |
A2E1 | ADF1 | U+2261 | |
A2DD | ADF6 | U+22A5 | |
ADEA | 8FF4AB | U+3231 |
- Note: ADxx, 8FF3xx and 8FF4xx are IBM/NEC extensions.
- Current index-jis0208.txt by WHATWG lacks mapping for 8FF3xx and 8FF4xx defined by eucjp-open.