Wikidata-Toolkit
Wikidata-Toolkit copied to clipboard
XMLStreamException exception when parsing revision dumps
When parsing some big revision dumps this error is raised:
sept. 23, 2016 7:05:46 PM org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor processDumpFileContents
SEVERE: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[291117,14066]
Message: JAXP00010004: The accumulated size of entities is "50 000 001" that exceeded the "50 000 000" limit set by "FEATURE_SECURE_PROCESSING".
It is probably because the XML files contains to many nodes with respect to the limit set by XML secure processing. We should probably set XMLConstants.FEATURE_SECURE_PROCESSING option to false as Wikimedia dump should be trustable.
Maybe our solution works here as well https://github.com/dbpedia/extraction-framework/issues/487#issuecomment-255297245
@jimkont Thanks for pointing this out.
@Tpt If you have time to try this, it would be very much appreciated.
It works great using the parameter -Djdk.xml.entityExpansionLimit=2147480000 when running JARs that uses WDTK.
We should maybe find a way to integrate it inside WDTK in order don't have to set this parameter all the time