woodstox icon indicating copy to clipboard operation
woodstox copied to clipboard

Add option to allow broken encoding in attibute values

Open ST-DDT opened this issue 5 years ago • 1 comments

I have to consume a message from a message broker with (sometimes) broken encoding in one of its attributes. (Its from a legacy software that nobody wants/dares to touch.)

Currently when trying to parse the mesages I get the following Exception:

com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #736, byte #53)
    at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
    at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
    ...

If I use the same bytes in a String directly it works perfectly fine.

It would be nice if I could use an option to allow broken encodings in my Strings instead of Exceptions. (After parsing the input, I usually have enough context to know which messages I have to fix and how)

I use jackson-dataformat-xml 2.9.6 + woodstox 5.0.3/5.1 to parse the message.

Currently I use the following workaround to bypass the issue:

byte[] bytes = ...; 
try {
	return xmlMapper.readValue(bytes, StateInfo.class);
} catch (JsonParseException e) {
	try {
		LOG.debug("Attempting fix");
		byte[] bytes2 = new String(bytes, CHARSET_ALT1).getBytes(UTF_8);
		return xmlMapper.readValue(bytes2, StateInfo.class);
	} catch (JsonParseException e1) {
                // Contains special characters from multiple encodings (in different attributes)
		LOG.error("Failed to repair message - Writing message to disk for manual fix");
		writeToDisk(e, bytes);
		throw e;
	}
}

As an alternative I considered using a plain byte solution, but unfortunately the parser still tries to parse the input as String so it can use it with base64 encoding and I did't find a way to tell the parser just give me the bytes without reverse base64 it first.

Code to reproduce

Data class:

@JacksonXmlRootElement(localName = "data")
public static class Data {

	@JsonProperty("attr")
	public String attr;
	// public byte[] attr;

	@Override
	public String toString() {
		return "Data: "+ attr;
	}

}

Test method:

public static void main(String[] args) throws IOException {
	XmlMapper xmlMapper = new XmlMapper();
	String input = "<data attr=\"Success\" />";
	byte[] bytes = input.getBytes("UTF-8");

	System.out.println(new String(bytes, "UTF-8"));
	System.out.println(xmlMapper.readValue(bytes, Data.class));

	bytes[13] = (byte) 0xfc; // u -> ü // Simulate broken encoding

	System.out.println(new String(bytes, "UTF-8"));
	System.out.println(xmlMapper.readValue(bytes, Data.class)); // Error
}

Output:

<data attr="Success" />
Data: Success
<data attr="S�ccess" />
Exception in thread "main" com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
	at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
	at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
	at example.Test.main(Test.java:67)
Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
	at com.ctc.wstx.io.UTF8Reader.reportInvalidInitial(UTF8Reader.java:304)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:190)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:89)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:995)
	at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:754)
	at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2074)
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1175)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:653)
	... 5 more

ST-DDT avatar Aug 09 '18 08:08 ST-DDT

When constructing String out of broken UTF-8 content, what happens? I am guessing invalid byte gets decoded as "question mark":

https://www.fileformat.info/info/unicode/char/0fffd/index.htm

which will then add garbage to attribute value.

I don't think this is something Woodstox should really be doing. Although I understand it may be inconvenient, I think handling of broken content is something that application needs to configure somehow.

cowtowncoder avatar Aug 21 '18 23:08 cowtowncoder