wikiforia Invalid white space character exception

Hi,

I'm using wikiforia (version 1.1.1) to parse the english wikipedia dump ("version" 20150602) and encounter this error and it makes wikiforia stop. Below is the log. How can I fix this?

java.io.IOError: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity) at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:91) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.pipeline.IdentityFilter.process(IdentityFilter.java:11) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.wikipedia.parser.SwebleWikimarkupParserBase.process(SwebleWikimarkupParserBase.java:91) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser.access$300(MultistreamBzip2XmlDumpParser.java:43) ~[wikiforia-1.1.1.jar:?] at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:349) ~[wikiforia-1.1.1.jar:?] Caused by: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:462) ~[woodstox-core-asl-4.2.0.jar:4.2.0] at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:85) ~[wikiforia-1.1.1.jar:?] ... 7 more Caused by: java.io.IOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.api.InvalidCharHandler$FailingHandler.convertInvalidChar(InvalidCharHandler.java:55) ~[woodstox-core-asl-4.2.0.jar:4.2.0] at com.ctc.wstx.sw.XmlWriter.handleInvalidChar(XmlWriter.java:623) ~[woodstox-core-asl-4.2.0.jar:4.2.0] at com.ctc.wstx.sw.BufferingXmlWriter.writeCharacters(BufferingXmlWriter.java:554) ~[woodstox-core-asl-4.2.0.jar:4.2.0] at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:460) ~[woodstox-core-asl-4.2.0.jar:4.2.0] at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:85) ~[wikiforia-1.1.1.jar:?] ... 7 more java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172) at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345) java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172) at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345) java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172) at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147) at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276) at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337) at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345)

Jul 16 '15 11:07 ngndn

The backtrace indicates a parsing problem from the decompressed XML content. I'm using Woodstox to do the actual XML parsing and BZip2 decompression from Apache Commons Compress. I have encountered parsing errors in the past but the cause was a corrupted dump.

I need the following to verify that its actually a bug and not a problem with the dump:

Check that the local dump hash matches with the server hash (verifies that it is downloaded correctly)
Do an integrity check of the compressed archive with bzip2, e.g "bunzip2 -tv FILE.xml.bz2" (verifies that it decompresses cleanly - hash could match but the content could be corrupt, I have encountered this.)

Jul 16 '15 11:07 marcusklang

Hi Markus,

I've run the first and the second step above for the wikipedia dump (English - 20150602) I've downloaded and It's all good. I think it's a bug.

Jul 17 '15 04:07 ngndn

The problem is that an invalid character has entered the XML stream, which implies that the input dump is not a valid XML document/stream. The woodstox parser that I'm using is a conformant XML parser which to my knowledge means that you unfortunately cannot simply turn of these checks, it has to be filtered out before XML parsing.

A few potential solutions are:

Implement a filtering stream that removes the invalid characters before entering XML parsing.
Replace the XML parser with something else, potentially a custom parser.

I will investigate the problem further.

Jul 17 '15 10:07 marcusklang

I was unable to reproduce the exception you are getting. I downloaded the enwiki version you used (20150602), and ran a full XML parse through the entire dump via a new test decompression feature and it passed without exception.

My testing environment is OS X 10.10.4 with Java version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

The two files I dowloaded with hashes after " <=> " enwiki-20150602-pages-articles-multistream-index.txt.bz2 <=> 100335c540528f8ff9f8cbd5164e7ca2 enwiki-20150602-pages-articles-multistream.xml.bz2 <=> 31e400b3893a4a970782c42aa5753392

Both have been verified against wikipedia dump server and the larger one contains exactly 12 796 448 739 bytes.

I have made a new release which is not identical to the version you used, and thus might fix the problem.

I also implemented a new test decompression function into Wikiforia which will only perform decompression and xml parsing, but no mediawiki parsing (time consideration). I have also added a feature that will extract the 2 blocks of compressed xml from the dump where it failed so that a manual inspection is possible.

Please try the new version and verify that we have the same dump in case it fails, or check the compressed part output from the program. The are named stream-current-[start byte position]-[length].xml.bz2 and stream-prev-[start byte position]-[length].xml.bz2, they are created in the working directory.

Jul 17 '15 13:07 marcusklang

Hi Marcus,

Thank you very much, I'll give you the result when I get back to office this morning

Jul 19 '15 17:07 ngndn

I've been running into the same problem (on the dumps of 2015-08-05 and 2015-09-01). Today I found a workaround by only including the 0 namespace (-fns 0). This still leaves me with 13GB of (uncompressed) data, which is about twice as much as I get when I run into the error (usually 6.9 GB)

Sep 17 '15 15:09 jjbankert

wikiforia wikiforia copied to clipboard

Invalid white space character exception

wikiforia
wikiforia copied to clipboard