Exception in thread "main" java.lang.NullPointerException
I used this java -cp cermine-impl-1.13-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path data_raw/pdfs on a nested folder of pdfs, getting a null pointer exception.
PDF with the issue:
https://www.cloud.luerig.net/index.php/s/CKQRnDePF9aRFwo
My java version (Windows 10 machine):
java version "1.8.0_251"
Java(TM) SE Runtime Environment (build 1.8.0_251-b08)
Java HotSpot(TM) Client VM (build 25.251-b08, mixed mode)
Full error msg:
Exception in thread "main" java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfImageObject.decodeImageBytes(PdfImageObject.java:298)
at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:199)
at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:168)
at com.itextpdf.text.pdf.parser.ImageRenderInfo.prepareImageObject(ImageRenderInfo.java:150)
at com.itextpdf.text.pdf.parser.ImageRenderInfo.getImage(ImageRenderInfo.java:140)
at pl.edu.icm.cermine.structure.ITextCharacterExtractor$BxDocumentCreator.renderImage(ITextCharacterExtractor.java:366)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ImageXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1311)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
at pl.edu.icm.cermine.InternalContentExtractor.getImages(InternalContentExtractor.java:169)
at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:290)
at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:307)
at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:805)
I just remembered filing a similar issue (https://github.com/CeON/CERMINE/issues/36) a few years ago. back then I asked whether there was a way for exception handling built into CERMINE - is that the case? otherwise I would try to run it from python to skip erroneous attempts.
this is a great tool btw, we are just about to submit our first publication based entirely on results obtained from CERMINE