pydifact icon indicating copy to clipboard operation
pydifact copied to clipboard

Encoding helpers

Open JocelynDelalande opened this issue 2 years ago • 3 comments

I kinda struggle with edifact encoding, but here what I came up to :

data:


# https://blog.sandro-pereira.com/2009/08/15/edifact-encoding-edi-character-set-support/
# https://www.truugo.com/edifact/d09a/cl0001/
# A bit unsure of how 10646-1 maps exactly to utf-8
EDIFACT_ENCODINGS = {
    "UNOA": "ascii",  # iso-"646",
    "UNOB": "ascii",  # iso-"646",
    "UNOC": "iso-8859-1",
    "UNOD": "iso-8859-2",
    "UNOE": "iso-8859-5",
    "UNOF": "iso-8859-7",
    "UNOG": "iso-8859-3",
    "UNOH": "iso-8859-4",
    "UNOI": "iso-8859-6",
    "UNOJ": "iso-8859-8",
    "UNOK": "iso-8859-9",
    "UNOW": "utf-8",  # "10646-1",
    "UNOX": "iso-2022-jp",  # "2022 2375",
    "UNOY": "utf-8",  # "10646-1",
}

deserializing helper:

def guess_edifact_encoding(stream):
    unb_line = b"\n"
    eof_marker = b""
    while not unb_line.startswith(b"UNB") and unb_line != eof_marker:
        unb_line = stream.readline()

    if not unb_line.startswith(b"UNB"):
        raise ParseError("Missing UNB segment: ")

    else:
        # Must be ASCII-only
        unb_line_s = unb_line.decode()
        parser = Parser()
        unb_segment = list(parser.parse(unb_line_s))[0]
        try:
            # Ignore version, always v1…
            encoding_element = unb_segment.elements[0][0]
            return EDIFACT_ENCODINGS[encoding_element]
        except KeyError:
            raise ParseError(f"Wrong encoding spec : {encoding_element}")

I wonder what pydifact could embed in its scope in terms of :

  • helper (data)
  • serialization helper (like having a Interchange.serialize_to_bytes() helper with automatic encoding selection based on syntax identifier ?)
  • deserialization from bytes handling decoding with a guesser like the one I wrote

Any thought appreciated :-).

JocelynDelalande avatar Dec 23 '22 11:12 JocelynDelalande

Hm. I am just dealing with 8869-1 encoding in my files. But yes, there is the specification for all in the header. But one thing I don't understand, and I frankly am ignoring it most of the times, as I'd like to see everything in utf8 - other encodings are stupid, and don't exist... ...and the earth is flat. OMG.

ok. AFAIU, it would suffice to read the 4 bytes of the interchange and decide which encoding to take, and then, read the rest of the stream in that encoding.

I have files starting with "UNB+ANSI:1+ME123456" - mostly without an UNA header, and none of those UNO[x] specifiers. An example is in the test data files. How to deal with that?

nerdoc avatar Dec 29 '22 21:12 nerdoc

ok. AFAIU, it would suffice to read the 4 bytes of the interchange and decide which encoding to take, and then, read the rest of the stream in that encoding.

I have files starting with "UNB+ANSI:1+ME123456" - mostly without an UNA header, and none of those UNO[x] specifiers. An example is in the test data files. How to deal with that?

Seems legit to make an assumption (like utf-8 ? Unless the standard have another-and-imcompatible default ?) and to offer a way to force a decoding charset. The forcing option will also marginally be useful to allow dealing with messages having bad match between unoa and actual encoding.

I suspect that pour exemple is ANSI/X12 and not edifact. So is it in the scope of pydifact anyway ?

JocelynDelalande avatar Dec 30 '22 07:12 JocelynDelalande

It is definitely part of what I want to cope with, because I need to deal with that kind of files... But I'm afraid this is EDIFACT. It's a file I got myself (just changed names to pseudonymize them) - but here in medical systems, many companies don't care about standards...

nerdoc avatar May 07 '23 18:05 nerdoc