js-codepage icon indicating copy to clipboard operation
js-codepage copied to clipboard

ISO 2022 JIS Japanese encoding fails

Open n1474335 opened this issue 6 years ago • 2 comments

Hi, thanks very much for your work on this repository, it's incredibly useful. We use it as the main character encoding library for CyberChef.

We've recently noticed an issue when trying to encode into ISO 2022 JIS Japanese where only null bytes are returned.

The affected CP numbers are 50220, 50221 and 50222.

Example code

import cptable from "codepage";

cptable.utils.encode(50220, "こんにちは");

Expected output

Uint8Array(10) [164, 179, 164, 243, 164, 203, 164, 193, 164, 207]

Actual output

Uint8Array(5) [0, 0, 0, 0, 0]

Can you shed any light on this behaviour?

n1474335 avatar Nov 01 '19 16:11 n1474335

Another example that also fails:

Code

import cptable from "codepage";

cptable.utils.encode(50220, "ーム")

Expected output

Uint8Array(10) [27, 36, 66, 33, 60, 37, 96, 27, 40, 66]

Actual output

Uint8Array(2) [0, 0]

n1474335 avatar Nov 01 '19 16:11 n1474335

Thanks for sharing! The ISO 2022 codepages 5022{0,1,2,5,7} are definitely incorrect -- hiragana require a control sequence and those are not currently supported. Based on ECMA-35, the first kana "こ" should be encoded as 1B 24 42 24 33 (1B 24 42 to switch to the JIS double byte encoding, 24 for the Hiragana subset and 43 for the actual character). This will require a direct implementation of control sequences and a new set of LUTs for the various character subsets.

PS: All of the generated codepages with source listed as "Windows 7" are assumed to either be single-byte or double-byte. Clearly that wasn't the case here.

SheetJSDev avatar Nov 01 '19 17:11 SheetJSDev