XmlToJson Can't parse XMLs containing BOM characters

Hello. Thank you for this library, I've been using it on some of my projects and I find it very useful! Today I found out that when I try to parse XML strings, fetched from an API starting with either the \u FEFF or the \u FFFF, the JSON returned is empty.

Is there a way to parse such XML strings, without the obvious workaround xmlString.replace(Regex("[\uFEFF-\uFFFF]"), "") and using that as input to the XmlToJson? I think this should be implemented inside the library, since it can be very difficult to try and debug it, since those characters are not visible in the Strings. I spent some hours trying to pinpoint what was wrong with my input not getting parsed correctly. Thank you!

Oct 23 '19 10:10 lpbas

You may add removeBom() method.

    private static byte[] removeBom(byte[] bytes) {
        if ((bytes.length >= 3) && (bytes[0] == -17) && (bytes[1] == -69) && (bytes[2] == -65)) {
            return Arrays.copyOfRange(bytes, 3, bytes.length);
        }
        if ((bytes.length >= 2) && (bytes[0] == -1) && (bytes[1] == -2)) {
            return Arrays.copyOfRange(bytes, 2, bytes.length);
        }
        if ((bytes.length >= 2) && (bytes[0] == -2) && (bytes[1] == -1)) {
            return Arrays.copyOfRange(bytes, 2, bytes.length);
        }
        return bytes;
    }

Jan 06 '20 15:01 javadev

hello Lucas @L4grange and Valentyn @javadev !

thanks for using XmlToJson :)

the bom characters are a little obscure to me, and this seem to be a very specific problem. The proposed solution certainly works in this case, but is not veryt generic.

I was wondering if there could be a wider solution for special characters like bom, or if this should be fixed outside of the library (before using it).

what do you think?

Jan 06 '20 20:01 smart-fun

Thank you for your reply @smart-fun ! Before facing this problem I had never worked with BOM characters before. However, since they are not visible and easy to catch, and they do exist in some XMLs with deprecated formatting, I think it would be good for this library to handle XMLs with or without BOM characters.

As for a more generalised solution, I agree, but I haven't found any other whitespace characters that were breaking the parsing. If you have an example of other characters, we could look for such a solution.

Jan 08 '20 08:01 lpbas

I found bom bytes for xml with encoding UTF-16 and UTF-32.

Jan 08 '20 10:01 javadev

Some entry points about BOM characters:

https://stackoverflow.com/questions/1772321/what-is-xml-bom-and-how-do-i-detect-it

https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream/499033#499033

https://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#getEncoding%28%29

https://www.xponentsoftware.com/articles/Byte_order_mark.aspx

Is it possible for one of you to attach to this issue a XML file containing BOM characters?

Jan 09 '20 20:01 smart-fun

@smart-fun I think the XML I've posted on https://github.com/smart-fun/XmlToJson/issues/20#issuecomment-542203816 contains BOM characters, as I was not able to parse it.

Jan 10 '20 07:01 lpbas

@L4grange I found usual xml without BOM. http://prntscr.com/qlp122

Jan 10 '20 08:01 javadev

I found an example.

itemdescription_20140527014836.xml.zip

Jan 11 '20 03:01 javadev

It seems there is a generic solution using Apache BOMInputStream. I gave it some try, but no success so far and not much time to try.

https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html

https://commons.apache.org/proper/commons-io/

tested things like:

BOMInputStream bim = new BOMInputStream(inputStream, false);

if (bim.hasBOM()) {
        bim.skip(bim.getBOM().length());
}

Jan 11 '20 14:01 smart-fun

XmlToJson XmlToJson copied to clipboard

Can't parse XMLs containing BOM characters

XmlToJson
XmlToJson copied to clipboard