woodstox
woodstox copied to clipboard
Add option to allow broken encoding in attibute values
I have to consume a message from a message broker with (sometimes) broken encoding in one of its attributes. (Its from a legacy software that nobody wants/dares to touch.)
Currently when trying to parse the mesages I get the following Exception:
com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #736, byte #53)
at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
...
If I use the same bytes in a String directly it works perfectly fine.
It would be nice if I could use an option to allow broken encodings in my Strings instead of Exceptions. (After parsing the input, I usually have enough context to know which messages I have to fix and how)
I use jackson-dataformat-xml 2.9.6 + woodstox 5.0.3/5.1 to parse the message.
Currently I use the following workaround to bypass the issue:
byte[] bytes = ...;
try {
return xmlMapper.readValue(bytes, StateInfo.class);
} catch (JsonParseException e) {
try {
LOG.debug("Attempting fix");
byte[] bytes2 = new String(bytes, CHARSET_ALT1).getBytes(UTF_8);
return xmlMapper.readValue(bytes2, StateInfo.class);
} catch (JsonParseException e1) {
// Contains special characters from multiple encodings (in different attributes)
LOG.error("Failed to repair message - Writing message to disk for manual fix");
writeToDisk(e, bytes);
throw e;
}
}
As an alternative I considered using a plain byte solution, but unfortunately the parser still tries to parse the input as String so it can use it with base64 encoding and I did't find a way to tell the parser just give me the bytes without reverse base64 it first.
Code to reproduce
Data class:
@JacksonXmlRootElement(localName = "data")
public static class Data {
@JsonProperty("attr")
public String attr;
// public byte[] attr;
@Override
public String toString() {
return "Data: "+ attr;
}
}
Test method:
public static void main(String[] args) throws IOException {
XmlMapper xmlMapper = new XmlMapper();
String input = "<data attr=\"Success\" />";
byte[] bytes = input.getBytes("UTF-8");
System.out.println(new String(bytes, "UTF-8"));
System.out.println(xmlMapper.readValue(bytes, Data.class));
bytes[13] = (byte) 0xfc; // u -> ü // Simulate broken encoding
System.out.println(new String(bytes, "UTF-8"));
System.out.println(xmlMapper.readValue(bytes, Data.class)); // Error
}
Output:
<data attr="Success" />
Data: Success
<data attr="S�ccess" />
Exception in thread "main" com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
at example.Test.main(Test.java:67)
Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
at com.ctc.wstx.io.UTF8Reader.reportInvalidInitial(UTF8Reader.java:304)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:190)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:89)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:995)
at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:754)
at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2074)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1175)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:653)
... 5 more
When constructing String
out of broken UTF-8 content, what happens? I am guessing invalid byte gets decoded as "question mark":
https://www.fileformat.info/info/unicode/char/0fffd/index.htm
which will then add garbage to attribute value.
I don't think this is something Woodstox should really be doing. Although I understand it may be inconvenient, I think handling of broken content is something that application needs to configure somehow.