iconv-lite icon indicating copy to clipboard operation
iconv-lite copied to clipboard

add some EBCDIC encodings

Open Mithgol opened this issue 8 years ago • 10 comments

Fixes #111 partially.

EBCDIC 037 mapping has been taken from http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT and automatically converted from 0xXXXX to \uXXXX format for JavaScript.

EBCDIC 1140 is said to be different only at code point 9F (I have manually retyped that difference).

Note: this pull request does not contain tests because I am not sure how they should look like.

Mithgol avatar Nov 06 '15 08:11 Mithgol

EBCDIC 500 mapping has been taken from http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP500.TXT and automatically converted from 0xXXXX to \uXXXX format for JavaScript.

EBCDIC 1148 is said to be different only at code point 9F (I have manually retyped that difference).

Mithgol avatar Nov 06 '15 09:11 Mithgol

There is some problems here with some of the control mappings. The problem arises because EBCDIC has a Carriage Return, New Line, and Line Feed. The problem with these mappings is that control characters in EBCDIC which do not translate have been given arbitrary unicode values starting at 0x80. This includes the NL character (0x20 in EBCDIC), which is assigned U+0080. On the systems I've touched the EBCDIC NL character is used in place of the LF character for marking EOL

devin122 avatar Apr 13 '17 16:04 devin122

Currently Wikipedia says that EBCDIC NL is 0x15 in EBCDIC 500 (and in its variation EBCDIC 1148) and in EBCDIC 037 (and in its variation EBCDIC 1140).

These four are mapped (by the Microsoft mappings, mentioned above) to U+0085 (officially said to be “NEXT LINE” or “NEL”) which seems correct to me.

Mithgol avatar Apr 16 '17 18:04 Mithgol

Im not really sure how many programs handle U+0085 properly. The other side is, when converting the other direction, with LF being the usual line terminator. means it gets converted to EBCDIC LF (0x25). I need to double check, but on the EBCDIC machines I've had access to, they do not like this at all. They want NL line endings.

devin122 avatar Apr 16 '17 21:04 devin122

As I have to add support for various EBCDIC encodings also I did a little research on this matter and I found an implementation by IBM which they open sourced.

Here ConversionMaps is used to map between encodings and code pages (or more formally CCSIDs). In ConvTable this mapping is now used to load the respective converter (i.e. ConvTable1140 to map between Unicode and EBCDIC (CCSID 037 = Euro update 1140 according to "Code pages with Latin-1 character sets" on the Wikipedia entry)). Skimming through their codebase a nice amount of such mappings are available, that might be helpful in adding support for those encodings to iconv-lite.

On using a bit more complex EBCDIC sample taken from this page I was able, after some back and force conversions and modifying my local sbcs-data.js file, to validate the correctness of the ebcdic.txt sample file against the ascii.txt file with a test like this:

    it("Read EBCDIC from stream", () => {
        let expected: string = fs.readFileSync("./test/ascii.txt", "latin1");
        while (expected.includes("\n") || expected.includes("\r")) {
            expected = expected.replace("\n", "").replace("\r", "");
        }

        // https://querysurge.zendesk.com/hc/en-us/articles/215029906-QuerySurge-and-Mainframe-Data-EBCDIC-Files
        // the EBCDIC file is UTF-8 encoded, so we'll need to specify this in the call. For the output
        // ASCII file, we'll use the ISO-8859-1 encoding. The record length for the sample file is 67
        // bytes
        const stream: Stream =
            fs.createReadStream("./test/ebcdic.txt")
                .pipe(iconv.decodeStream("utf8"))
                .pipe(iconv.encodeStream("iso88591"))
                .pipe(iconv.decodeStream("ebcdic037"))
                // .pipe(iconv.decodeStream("ebcdic1140"))
                // .pipe(iconv.decodeStream("ebcdic500"))
                // .pipe(iconv.decodeStream("ebcdic1148"))
        ;
        
        const chunks: unknown[] = [];
        stream.on("data", (chunk: string) => chunks.push(Buffer.from(chunk)));
        stream.on("end", () => assert.deepStrictEqual(chunks.toString(), expected));
    });

This sample test works with CCSID: 037, 277, 280, 284, 285, 297, 500, 1047 but fails for i.e. 273

BTW, one can check EBCDIC files in IntelliJ quite easily just by changing the file encoding from the default UTF-8 to i.e. IBM01140 or similar ones. Unfortunately, I need such support in Visual Studio Code, which seem to rely on jschardet and iconv-lite to probe and convert between encodings.

HTH

RovoMe avatar Jun 30 '20 15:06 RovoMe

Thanks for the research @RovoMe! Any specific action items you would like to add here, or is it mostly additional info?

I always try to generate the encodings directly from authoritative sources, e.g. see in https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-dbcs.js we download corresponding tables from unicode.org or encoding.spec.whatwg.org.

To support EBCDIC, ideally I'd want something like gen-ebcdic.js that downloads the tables from unicode.org and transforms it to iconv-lite format. Java sources are not work great for that purpose, unfortunately.

Also I think the NL concern by @devin122 is valid (see https://en.wikipedia.org/wiki/Newline#Representation). We might want to address it by 1) encoding/decoding without changes by default, this would keep 1:1 representation of all latin1 characters, but then 2) add a codec option like EBCDICNLConversion: '\n', which would enable conversion of NL char to corresponding char(s). This conversion can probably be a separate PR.

Finally, FYI, we do work on integrating iconv-lite into VS Code, but it hasn't happened yet.

ashtuchkin avatar Jul 01 '20 19:07 ashtuchkin

I would like this please. Thanks!

Fish1 avatar Aug 11 '21 19:08 Fish1

Agreed, it would be very helpful to have the capability of opening encodings like CP037.

Dman247 avatar Aug 11 '21 19:08 Dman247

Is there any chance this PR is going forward?

GitMensch avatar Aug 23 '22 16:08 GitMensch

vscode depends on this issue - https://github.com/microsoft/vscode/issues/49891 is "the big and old" one, duplicates are at least https://github.com/microsoft/vscode/issues/147064 https://github.com/microsoft/vscode/issues/179693.

@ashtuchkin Can you take a look at integrating this and publish a new version?

GitMensch avatar Nov 06 '23 17:11 GitMensch