Encoding helpers
I kinda struggle with edifact encoding, but here what I came up to :
data:
# https://blog.sandro-pereira.com/2009/08/15/edifact-encoding-edi-character-set-support/
# https://www.truugo.com/edifact/d09a/cl0001/
# A bit unsure of how 10646-1 maps exactly to utf-8
EDIFACT_ENCODINGS = {
"UNOA": "ascii", # iso-"646",
"UNOB": "ascii", # iso-"646",
"UNOC": "iso-8859-1",
"UNOD": "iso-8859-2",
"UNOE": "iso-8859-5",
"UNOF": "iso-8859-7",
"UNOG": "iso-8859-3",
"UNOH": "iso-8859-4",
"UNOI": "iso-8859-6",
"UNOJ": "iso-8859-8",
"UNOK": "iso-8859-9",
"UNOW": "utf-8", # "10646-1",
"UNOX": "iso-2022-jp", # "2022 2375",
"UNOY": "utf-8", # "10646-1",
}
deserializing helper:
def guess_edifact_encoding(stream):
unb_line = b"\n"
eof_marker = b""
while not unb_line.startswith(b"UNB") and unb_line != eof_marker:
unb_line = stream.readline()
if not unb_line.startswith(b"UNB"):
raise ParseError("Missing UNB segment: ")
else:
# Must be ASCII-only
unb_line_s = unb_line.decode()
parser = Parser()
unb_segment = list(parser.parse(unb_line_s))[0]
try:
# Ignore version, always v1…
encoding_element = unb_segment.elements[0][0]
return EDIFACT_ENCODINGS[encoding_element]
except KeyError:
raise ParseError(f"Wrong encoding spec : {encoding_element}")
I wonder what pydifact could embed in its scope in terms of :
- helper (data)
- serialization helper (like having a
Interchange.serialize_to_bytes()helper with automatic encoding selection based on syntax identifier ?) - deserialization from bytes handling decoding with a guesser like the one I wrote
Any thought appreciated :-).
Hm. I am just dealing with 8869-1 encoding in my files. But yes, there is the specification for all in the header. But one thing I don't understand, and I frankly am ignoring it most of the times, as I'd like to see everything in utf8 - other encodings are stupid, and don't exist... ...and the earth is flat. OMG.
ok. AFAIU, it would suffice to read the 4 bytes of the interchange and decide which encoding to take, and then, read the rest of the stream in that encoding.
I have files starting with "UNB+ANSI:1+ME123456" - mostly without an UNA header, and none of those UNO[x] specifiers. An example is in the test data files. How to deal with that?
ok. AFAIU, it would suffice to read the 4 bytes of the interchange and decide which encoding to take, and then, read the rest of the stream in that encoding.
I have files starting with "UNB+ANSI:1+ME123456" - mostly without an UNA header, and none of those UNO[x] specifiers. An example is in the test data files. How to deal with that?
Seems legit to make an assumption (like utf-8 ? Unless the standard have another-and-imcompatible default ?) and to offer a way to force a decoding charset. The forcing option will also marginally be useful to allow dealing with messages having bad match between unoa and actual encoding.
I suspect that pour exemple is ANSI/X12 and not edifact. So is it in the scope of pydifact anyway ?
It is definitely part of what I want to cope with, because I need to deal with that kind of files... But I'm afraid this is EDIFACT. It's a file I got myself (just changed names to pseudonymize them) - but here in medical systems, many companies don't care about standards...