Wikidata-Toolkit
Wikidata-Toolkit copied to clipboard
IndexOutOfBoundsException via org.wikidata.wdtk.storage.datastructures.BitVectorImpl.assertRange(BitVectorImpl.java:169)
Dear Devs,
I'm making use of the wikidata-toolkit in a little project at https://bitbucket.org/mjw99/wikidatachemscraper/overview . Essentially, I'm trying to harvest all chemical structure diagrams that are in the SVG format and have a Standard InChI Key associated with them.
If one follows the example on the spash screen of the the project, I am seeing the following exception:
2015-07-28 17:46:02 INFO - [statistics] Namespaces: {0=, 1=Talk, 2=User, 3=User talk, 4=Wikidata, 5=Wikidata talk, 6=File, 7=File talk, 8=MediaWiki, 829=Module talk, 9=MediaWiki talk, 828=Module, 10=Template, 11=Template talk, 12=Help, 13=Help talk, 14=Category, 15=Category talk, 2600=Topic, -2=Media, -1=Special, 1199=Translations talk, 1198=Translations, 123=Query talk, 122=Query, 121=Property talk, 120=Property}
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Position 207239102 is out of bounds.
at org.wikidata.wdtk.storage.datastructures.BitVectorImpl.assertRange(BitVectorImpl.java:169)
at org.wikidata.wdtk.storage.datastructures.BitVectorImpl.getBit(BitVectorImpl.java:241)
at org.wikidata.wdtk.dumpfiles.MwRevisionProcessorBroker.processRevision(MwRevisionProcessorBroker.java:138)
at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlRevision(MwRevisionDumpFileProcessor.java:427)
at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlPage(MwRevisionDumpFileProcessor.java:345)
at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.tryProcessXmlPage(MwRevisionDumpFileProcessor.java:282)
at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlMediawiki(MwRevisionDumpFileProcessor.java:202)
at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processDumpFileContents(MwRevisionDumpFileProcessor.java:155)
at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processDumpFile(DumpProcessingController.java:559)
at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processDump(DumpProcessingController.java:495)
at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processMostRecentMainDump(DumpProcessingController.java:444)
at name.mjw.wikidatachemscraper.StdInChIKeyExample.main(StdInChIKeyExample.java:30)
... 6 more
This worked with an older dump (early 2014), but this dump is no longer available.
Any ideas?
Thanks,
Mark
This is caused by our bitvector implementation not growing automatically in that release. Because the number of revisions has grown beyond the (hardcoded) bounds, you are seeing an out-of-bounds error.
If you only need the current content of Wikidata then you should switch to using JSON dumps if at ll possible. They are smaller files, parsed more reliably, and also faster to process. The XML dump is only needed if you want revision data (user id, revision time, etc.), or if you need a historic record of all past revisions (this is very large and will grow further, thus taking a long time to parse).
Nevertheless, we should also fix the problem by having the bitvector grow automatically. In fact, it might be that this is already implemented and just not part of the release you are using. Alternatively, it might be that our implementation supports bitvector growth in principle but fails to trigger it in your case. We will need to investigate this.
Dear Marcus, Apologies for the delay in replying. Thank you for the explanation and I will follow your advice re switching to JSON dumps.
Thanks,
Mark