wikibrain icon indicating copy to clipboard operation
wikibrain copied to clipboard

java.lang.IndexOutOfBoundsException in WikiTextParser

Open cheetah90 opened this issue 8 years ago • 1 comments

I am parsing the Spanish Wikipedia XML dumps using WikiTextParser and getting the following error. At the ends, there are 4000+ IndexOutofBounds errors.

Another weird thing about Spanish Wikipedia parsing, which might or might not be related to this, is that there is no Namespace.CATEGORY parsed and the category_members table ends up to be very small. Is it possible that the category pages fired the IndexOutofBounds exceptions?

22:14:08.314 [pool-4-thread-8] INFO  org.wikibrain.parser.wiki.LocalLinkVisitor - Visited link #4000000
22:14:09.392 [pool-4-thread-6] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 420000
22:14:10.670 [pool-4-thread-7] WARN  org.wikibrain.parser.wiki.WikiTextDumpParser - exception while parsing unknown
java.lang.IndexOutOfBoundsException: Index: 2938, Size: 2938
    at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_66]
    at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_66]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.SpanManager.getSrcPos(SpanManager.java:63) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.buildNestedList(ModularParser.java:1234) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parseSections(ModularParser.java:592) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at de.tudarmstadt.ukp.wikipedia.parser.mediawiki.ModularParser.parse(ModularParser.java:401) ~[de.tudarmstadt.ukp.wikipedia.parser-0.9.2.jar:?]
    at org.wikibrain.parser.wiki.WikiTextParser.parse(WikiTextParser.java:64) ~[classes/:?]
    at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:97) [classes/:?]
    at org.wikibrain.parser.wiki.WikiTextDumpParser$ParserProcedure.call(WikiTextDumpParser.java:76) [classes/:?]
    at org.wikibrain.utils.ParallelForEach$4.run(ParallelForEach.java:177) [classes/:?]
    at org.wikibrain.utils.ParallelForEach$BoundedExecutor$1.run(ParallelForEach.java:257) [classes/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]
22:14:14.156 [pool-4-thread-8] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 430000
22:14:20.485 [pool-4-thread-6] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 440000
22:14:29.073 [pool-4-thread-8] INFO  org.wikibrain.utils.ParallelForEach - processing iterable 450000

cheetah90 avatar Dec 08 '15 04:12 cheetah90

These happen regularly due to the underlying parsing library (de.tudarmstadt) and appear mostly benign.

shilad avatar Dec 09 '15 03:12 shilad