pycrate icon indicating copy to clipboard operation
pycrate copied to clipboard

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments)

Open rmwesley opened this issue 1 year ago • 3 comments

I am not asking for a fix. Just explaining an issue.

Here is the command I ran on bash:

$ ./tools/pycrate_asn1compile.py -i DSRC_instances_asn1_specs/EN15509/ -o DSRC_instances_asn1_specs/EN15509 -j
./tools/pycrate_asn1compile.py, args error: unable to read input file DSRC_instances_asn1_specs/EN15509/ISO14906Amd(2014)EfcDsrcGenericv5.asn
'utf-8' codec can't decode byte 0x93 in position 10503: invalid start byte

So we see there are invalid "utf-8" characters in the ASN1 file.

In one of the comments present in the ISO14906Amd(2014)EfcDsrcGenericv5.asn file they surrounded the word UNIX time with left (“ = 0x93 in Latin1) and right (” = 0x94 in Latin1) double quotation marks instead of plain ASCII quotation marks "(0x22), like so: “UNIX time”

This makes up invalid UTF-8 text. We find the same kind of issue in EfcDsrcApplicationv5 and AVIAEINumberingAndDataStructures, be it for double quotation marks or other such characters, such as single quotation marks and dashes (’=0x94 and –=0x96).

As is often recommended, one should manually remove the comments from the ASN.1 specs. Instead, I will simply change these characters by hand to their ASCII equivalents to make up valid UTF-8 text and compile the ASN1 specs from that point. They are just comments after all...

It is impossible to detect 8-bit encodings programatically, right? Only if it is kept as metadata or noted down somewhere. If the encoding could be determined, we could then simply do open("myfile", encoding=determined_encoding).

Just to note, I downloaded the original ASN.1 specs directly from the official ISO site. Some of the specifications using ISO-8859-1 (Latin1) encoding are https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcGenericv5.asn, https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcApplicationv5.asn and https://standards.iso.org/iso/14816/ISO14816%20ASN.1%20repository/ISO14816_AVIAEINumberingAndDataStructures.asn.

rmwesley avatar Oct 04 '24 06:10 rmwesley

I agree that many ASN.1 specs provided here and there contain misencoded (or sometimes simply invalid) characters. This is generally the result of how the work is organized when building a technical standard or specification: different contributions from different companies and regions of the world are all merged in a big Word document, which then is eventually converted to PDF. This is error prone!

On the other side, the current pycrate ASN.1 compiler tries to decode any input as UTF8 and breaks if it contains a non-UTF8 byte. What could be done is:

  • convert wrongly encoded but meaningful characters to their expected UTF8 encoding.
  • drop invalid bytes when they just breaks the UTF8 decoding.

This could lead to better acceptance of ASN.1 specs at the end.

mitshell avatar Oct 08 '24 20:10 mitshell

For my own recollection, this is happening here: https://github.com/pycrate-org/pycrate/blob/0b8309e258c59de9d0d7e6da31566097c92f224b/tools/pycrate_asn1compile.py#L200

mitshell avatar Oct 08 '24 20:10 mitshell

* convert wrongly encoded but meaningful characters to their expected UTF8 encoding.

There are inherent limitations that make it impossible to always correctly convert a wrongly encoded sequence, but there's a well crafted python library for fixing mojibake that works pretty well here: https://ftfy.readthedocs.io/en/latest/detect.html

Here's what this looks like minimally added to pycrate. I was able to decode and successfully compile @rmwesley 's example specifications.

diff --git a/tools/pycrate_asn1compile.py b/tools/pycrate_asn1compile.py
index d12b641..9c99e2e 100755
--- a/tools/pycrate_asn1compile.py
+++ b/tools/pycrate_asn1compile.py
@@ -33,6 +33,7 @@ import os
 import sys
 import argparse
 import inspect
+import ftfy
 
 from pycrate_asn1c.generator import _Generator
 from pycrate_asn1c.asnproc import (
@@ -97,6 +98,8 @@ def main():
                         help='provide an alternative python generator file path')
     parser.add_argument('-j', dest='json', action='store_true',
                         help='output a json file with information on ASN.1 objects dependency')
+    parser.add_argument('-G', action='store_true',
+                        help='guess the file encoding with python-ftfy')
     parser.add_argument('-fautotags', action='store_true',
                         help='force AUTOMATIC TAGS for all ASN.1 modules')
     parser.add_argument('-fextimpl', action='store_true',
@@ -196,15 +199,20 @@ def main():
         # read all file content into a single buffer
         txt = []
         for f in files:
+            open_file = lambda f: open(f, 'rb', encoding='utf-8')
+            read_file = lambda fd: fd.read()
+            if args.G:  # guess the encoding with ftfy
+                open_file = lambda f: open(f, 'rb')
+                read_file = lambda fd: ftfy.guess_bytes(fd.read())[0]
             try:
-                fd = open(f, 'r', encoding='utf-8')
+                fd = open_file(f)
             except Exception as e:
                 print('%s, args error: unable to open input file %s' % (sys.argv[0], f))
                 print(e)
                 return 1
             else:
                 try:
-                    txt.append( fd.read() )
+                    txt.append( read_file(fd) )
                 except Exception as e:
                     print('%s, args error: unable to read input file %s' % (sys.argv[0], f))
                     print(e)

I wouldn't want to include it exactly as is without better documentation around the feature, for reasons summed up in the documentation:

Unlike the rest of ftfy, this may not be accurate, and it may create Unicode problems instead of solving them!

But it may be useful to people with poorly encoded asn1 files. Using ftfy on the command line is very simple: https://ftfy.readthedocs.io/en/latest/cli.html

$ ftfy -g bad.asn > good.asn

oopsbagel avatar Apr 10 '25 03:04 oopsbagel