Wikidata-Toolkit icon indicating copy to clipboard operation
Wikidata-Toolkit copied to clipboard

XMLStreamException exception when parsing revision dumps

Open Tpt opened this issue 8 years ago • 3 comments

When parsing some big revision dumps this error is raised:

sept. 23, 2016 7:05:46 PM org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor processDumpFileContents
SEVERE: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[291117,14066]
Message: JAXP00010004: The accumulated size of entities is "50 000 001" that exceeded the "50 000 000" limit set by "FEATURE_SECURE_PROCESSING".

It is probably because the XML files contains to many nodes with respect to the limit set by XML secure processing. We should probably set XMLConstants.FEATURE_SECURE_PROCESSING option to false as Wikimedia dump should be trustable.

Tpt avatar Sep 24 '16 07:09 Tpt

Maybe our solution works here as well https://github.com/dbpedia/extraction-framework/issues/487#issuecomment-255297245

jimkont avatar Oct 21 '16 05:10 jimkont

@jimkont Thanks for pointing this out.

@Tpt If you have time to try this, it would be very much appreciated.

mkroetzsch avatar Oct 21 '16 10:10 mkroetzsch

It works great using the parameter -Djdk.xml.entityExpansionLimit=2147480000 when running JARs that uses WDTK.

We should maybe find a way to integrate it inside WDTK in order don't have to set this parameter all the time

Tpt avatar Nov 06 '16 09:11 Tpt