spring-batch
spring-batch copied to clipboard
Platform default encoding used unconditionally in StaxEventItemReader
Bug description StaxEventItemReader uses platform default encoding to read XML files.
Environment Spring-batch version 4.3.0. Related to #807
Steps to reproduce Just try to parse an XML file with a declared encoding different to the platform encoding.
Expected behavior Encoding field in StaxEventItemReader should default to null, so that the parser has the opportunity to autodetect the file encoding.
Hi @benas. I've checked that just assigning null on the encoding field in StaxEventItemReader allows the eventReader to autodetect the XML file encoding.
Thank you for checking. However, while reviewing this, I don't see anything wrong with using the platform's default encoding as a default encoding. I believe that is a better default than null.
The title of this issue is implying that the default encoding is used unconditionally, but that is not accurate. It is just a default value, made configurable in #807. If the default encoding is not suitable, it can be changed with the setter added in #807.
Closing this as a non issue.
Hi, @fmbenhassine: XML files are a bit different to flat files, as they include a header that declares the encoding of the data that follows.
In my experience it's quite common to pass null as the encoding to the XML parser so that it can decide about it reading the header. Sometimes, it's useful to force the parser to use an encoding when the file to parse doesn't contain the encoding header or when this header is not correct, but this situation is uncommon.
Nevertheless, the configurable encoding introduced in #807, would be enough if the setter would't reject the null value, as it does currently:
/**
* Set encoding to be used for the input file. Defaults to {@link #DEFAULT_ENCODING}.
*
* @param encoding the encoding to be used
*/
public void setEncoding(String encoding) {
Assert.notNull(encoding, "The encoding must not be null");
this.encoding = encoding;
}
That's a different story. Accepting null and using null as default are different design choices. I am re-opening this issue as we can make the reader accept null, but we need to make sure to address this correctly when configuring the event reader.
In my experience it's quite common to pass null as the encoding to the XML parser so that it can decide about it reading the header
Some StAX implementations might not accept nulls. Spring Batch uses Woodstox, which accepts null encoding, but since the StaxEventItemReader uses the implementation-agnostic XMLInputFactory.createXMLEventReader which does not document how to handle null encodings, I prefer not passing null.
The goal is to cover the option of letting the reader detect the encoding without passing null. I see there are two overloaded XMLInputFactory.createXMLEventReader methods to create the reader:
/**
* Create a new XMLEventReader from a java.io.InputStream
* @param stream the InputStream to read from
* @return an instance of the {@code XMLEventReader}
* @throws XMLStreamException if an error occurs
*/
public abstract XMLEventReader createXMLEventReader(java.io.InputStream stream)
throws XMLStreamException;
/**
* Create a new XMLEventReader from a java.io.InputStream
* @param stream the InputStream to read from
* @param encoding the character encoding of the stream
* @return an instance of the {@code XMLEventReader}
* @throws XMLStreamException if an error occurs
*/
public abstract XMLEventReader createXMLEventReader(java.io.InputStream stream, String encoding)
throws XMLStreamException;
If we make the setter accept null, do you think that xmlInputFactory.createXMLEventReader(inputStream, null) and xmlInputFactory.createXMLEventReader(inputStream) have the same effect WRT detecting the encoding? I will give it a try, but if you can test that and share your feedback, I would be grateful. Using xmlInputFactory.createXMLEventReader(inputStream) is safer and could be added to Spring Batch. The idea is to update the StaxEventItemReader with something like this:
eventReader = this.encoding != null ? xmlInputFactory.createXMLEventReader(inputStream, this.encoding) :
xmlInputFactory.createXMLEventReader(inputStream);
That's a different story. Accepting null and using null as default are different design choices. I am re-opening this issue as we can make the reader accept null, but we need to make sure to address this correctly when configuring the event reader.
That's a point!
If we make the setter accept null, do you think that xmlInputFactory.createXMLEventReader(inputStream, null) and xmlInputFactory.createXMLEventReader(inputStream) have the same effect WRT detecting the encoding?
Good question. My intuition suggests that both methods should have the same effect, but intuition and XML APIs do not always match.
I will give it a try, but if you can test that and share your feedback, I would be grateful. Using xmlInputFactory.createXMLEventReader(inputStream) is safer and could be added to Spring Batch.
Tried both methods in two different 1.8 JREs (IBM and SUN). All of them work the same way no matter you use xmlInputFactory.createXMLEventReader(inputStream, null) or xmlInputFactory.createXMLEventReader(inputStream).
I've drilled down a bit inside both code flows and all of them converge into xmlInputFactory.getXMLStreamReaderImpl(InputSource) receiving an implementation of org.xml.sax.InputSource with a null encoding. Seems that InputSource has a documented behaviour regarding encoding which matches with intuition:
* <p>The SAX parser will use the InputSource object to determine how
* to read XML input. If there is a character stream available, the
* parser will read that stream directly, disregarding any text
* encoding declaration found in that stream.
* If there is no character stream, but there is
* a byte stream, the parser will use that byte stream, using the
* encoding specified in the InputSource or else (if no encoding is
* specified) autodetecting the character encoding using an algorithm
* such as the one in the XML specification. If neither a character
* stream nor a
* byte stream is available, the parser will attempt to open a URI
* connection to the resource identified by the system
* identifier.</p>
Thank you for your feedback, Appreciate it! I think using xmlInputFactory.createXMLEventReader(inputStream) is safe when the encoding is not specified, and since Spring Batch uses the implementation agnostic API, we can always defer to the implementation for specific details (ie correctly or incorrectly handling encodings).
I will plan this enhancement for the upcoming 5.0.2, by accepting null in the setter and calling xmlInputFactory.createXMLEventReader(inputStream) when the encoding is not specified.
Backport-to-4.3.x, also appreciated, thanks!