XmlToJson
XmlToJson copied to clipboard
Can't parse XMLs containing BOM characters
Hello. Thank you for this library, I've been using it on some of my projects and I find it very useful!
Today I found out that when I try to parse XML strings, fetched from an API starting with either the \u FEFF
or the \u FFFF
, the JSON returned is empty.
Is there a way to parse such XML strings, without the obvious workaround xmlString.replace(Regex("[\uFEFF-\uFFFF]"), "")
and using that as input to the XmlToJson? I think this should be implemented inside the library, since it can be very difficult to try and debug it, since those characters are not visible in the Strings. I spent some hours trying to pinpoint what was wrong with my input not getting parsed correctly. Thank you!
You may add removeBom()
method.
private static byte[] removeBom(byte[] bytes) {
if ((bytes.length >= 3) && (bytes[0] == -17) && (bytes[1] == -69) && (bytes[2] == -65)) {
return Arrays.copyOfRange(bytes, 3, bytes.length);
}
if ((bytes.length >= 2) && (bytes[0] == -1) && (bytes[1] == -2)) {
return Arrays.copyOfRange(bytes, 2, bytes.length);
}
if ((bytes.length >= 2) && (bytes[0] == -2) && (bytes[1] == -1)) {
return Arrays.copyOfRange(bytes, 2, bytes.length);
}
return bytes;
}
hello Lucas @L4grange and Valentyn @javadev !
thanks for using XmlToJson :)
the bom characters are a little obscure to me, and this seem to be a very specific problem. The proposed solution certainly works in this case, but is not veryt generic.
I was wondering if there could be a wider solution for special characters like bom, or if this should be fixed outside of the library (before using it).
what do you think?
Thank you for your reply @smart-fun ! Before facing this problem I had never worked with BOM characters before. However, since they are not visible and easy to catch, and they do exist in some XMLs with deprecated formatting, I think it would be good for this library to handle XMLs with or without BOM characters.
As for a more generalised solution, I agree, but I haven't found any other whitespace characters that were breaking the parsing. If you have an example of other characters, we could look for such a solution.
I found bom bytes for xml with encoding UTF-16 and UTF-32.
Some entry points about BOM characters:
https://stackoverflow.com/questions/1772321/what-is-xml-bom-and-how-do-i-detect-it
https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream/499033#499033
https://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#getEncoding%28%29
https://www.xponentsoftware.com/articles/Byte_order_mark.aspx
Is it possible for one of you to attach to this issue a XML file containing BOM characters?
@smart-fun I think the XML I've posted on https://github.com/smart-fun/XmlToJson/issues/20#issuecomment-542203816 contains BOM characters, as I was not able to parse it.
@L4grange I found usual xml without BOM. http://prntscr.com/qlp122
It seems there is a generic solution using Apache BOMInputStream. I gave it some try, but no success so far and not much time to try.
https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html
https://commons.apache.org/proper/commons-io/
tested things like:
BOMInputStream bim = new BOMInputStream(inputStream, false);
if (bim.hasBOM()) {
bim.skip(bim.getBOM().length());
}