spring-batch
spring-batch copied to clipboard
TransactionAwareBufferedWriter adds byte order mark on each chunk [BATCH-1985]
Jimmy Praet opened BATCH-1985 and commented
When using a TransactionAwareBufferedWriter (FlatFileItemWriter or StaxEventItemWriter) with an encoding that requires a byte order mark (e.g. UTF-16), the byte order mark (BOM) is emitted on each chunk. On each chunk string.getBytes(encoding) is called on the string buffer, which will return a BOM as the first few bytes of the byte array.
The BOM should only be written at the very beginning of the output stream. If a BOM appears anywhere else, it is interpreted as a 'ZERO-WIDTH NON-BREAKING SPACE'.
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#standard https://forums.oracle.com/forums/thread.jspa?threadID=2042544 http://www.unicode.org/faq/utf_bom.html
No further details from BATCH-1985
Michael Minella commented
Small point of clarification, the extra bytes are not emitted on each chunk. There seems to be additional bytes added when restarting (appending to a file).
Jimmy Praet commented
I have a test case here: https://github.com/jpraet/spring-batch/commit/a689e25fb27b3f530cc35ff850fd2d01e22bd2ae and I'm seeing the BOM being emitted on each chunk, which makes sense because the buffer is cleared on each chunk.
I have found the following encodings affected by this bug:
- UTF-16
- x-UTF-32BE-BOM
- x-UTF-32LE-BOM
- UnicodeBig
- UnicodeLittle
Thank you for opening the issue. Can you retry with the latest release of Spring Batch(5.0.2) and report back the results?