jhove
jhove copied to clipboard
Example file generates "java.lang.NullPointerException" from PDF-hul
Dev Effort
1D
Description
Attaching 0694.pdf
File validates with veraPDF, but jhove fails with a Java exception. We have introduced JHOVE into our production process and while processing can go forward for a file JHOVE considers invalid, it currently stops processing if JHOVE has an exception. Rather than adjusting our processing to allow exceptions of this type, I'm wondering if this is something that can be looked into within JHOVE?
cihm@quark:~$ pdfinfo /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf
Producer: ABBYY Recognition Server
CreationDate: Sat Apr 29 02:25:05 2017
ModDate: Sat Apr 29 02:25:05 2017
Tagged: yes
Form: none
Pages: 1
Encrypted: no
Page size: 670.45 x 1158.9 pts
Page rot: 0
File size: 719683 bytes
Optimized: no
PDF version: 1.4
cihm@quark:~$ /opt/jhove/jhove -k -m PDF-hul -h xml /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf
Jun 02, 2017 2:03:09 PM Jhove main
SEVERE: null
java.lang.NullPointerException
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:292)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:310)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:280)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:346)
at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1065)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:498)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
cihm@quark:~$
russell@russell-desktop:~$ verapdf/verapdf /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf
<?xml version="1.0" encoding="utf-8"?>
<report>
<buildInformation><releaseDetails id="core" version="1.2.2" buildDate="2017-03-01T20:46:00-05:00"></releaseDetails><releaseDetails id="gui" version="1.2.1-PDFBOX" buildDate="2017-03-01T22:00:00-05:00"></releaseDetails><releaseDetails id="pdfbox-validation-model" version="1.2.2" buildDate="2017-03-01T20:56:00-05:00"></releaseDetails></buildInformation>
<jobs>
<job><item size="719683"><name>/opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf</name></item><validationReport profileName="PDF/A-1A validation profile" statement="PDF file is compliant with Validation Profile requirements." isCompliant="true"><details passedRules="106" failedRules="0" passedChecks="7794" failedChecks="0"></details></validationReport>
<processingTime>00:00:01:325</processingTime>
</job>
</jobs>
<summary jobs="1" failedJobs="0" valid="1" inValid="0" validExcep="0" features="0"><duration start="1496426672185" finish="1496426673946">00:00:01:761</duration></summary>
</report>
russell@russell-desktop:~$
Clearly a PDF module parser bug. There have been a ton of those. Sorry.
I don't have the resources or JAVA skills at the moment to be of much help. In the interim I can run files against the bytestream module if they have this type of failure when using the matching module.
Is this something that is going to wait until veraPDF integration to fix, or something that someone might take a look at earlier? This was a random PDF file among many thousands that JHOVE processed without error (We now generate single-page PDF files from OCR for most images ingested into our TDR, and generate a JHOVE report for each file ingested).
A change has been committed for the next release which should, in many cases, allow JHOVE to continue processing files after a module has failed to parse a file, such as with the example you provided earlier.
I'm currently using JHOVE 1.22, seeing what I think is a related problem. While the previous example I posted can be run without warnings, other files have the same warning.
Example: oocihm.28876/data/sip/data/files/0102.pdf 0102.pdf
russell@russell-XPS-13-9370:~/Downloads$ pdfinfo 0102.pdf
Producer: ABBYY Recognition Server
CreationDate: Tue Nov 13 12:05:04 2018 EST
ModDate: Tue Nov 13 12:05:04 2018 EST
Tagged: yes
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 522.25 x 699.35 pts
Page rot: 0
File size: 574278 bytes
Optimized: no
PDF version: 1.4
russell@russell-XPS-13-9370:~/Downloads$ ~/verapdf/verapdf 0102.pdf
<?xml version="1.0" encoding="utf-8"?>
<report>
<buildInformation>
<releaseDetails id="core" version="1.14.105" buildDate="2019-10-24T22:54:00-04:00"></releaseDetails>
<releaseDetails id="gui" version="1.14.8" buildDate="2019-10-24T23:11:00-04:00"></releaseDetails>
<releaseDetails id="pdfbox-validation-model" version="1.14.105" buildDate="2019-10-24T23:01:00-04:00"></releaseDetails>
</buildInformation>
<jobs>
<job>
<item size="574278">
<name>/home/russell/Downloads/0102.pdf</name>
</item>
<validationReport profileName="PDF/A-1A validation profile" statement="PDF file is compliant with Validation Profile requirements." isCompliant="true">
<details passedRules="107" failedRules="0" passedChecks="7923" failedChecks="0"></details>
</validationReport>
<duration start="1574186027705" finish="1574186028607">00:00:00.902</duration>
</job>
</jobs>
<batchSummary totalJobs="1" failedToParse="0" encrypted="0">
<validationReports compliant="1" nonCompliant="0" failedJobs="0">1</validationReports>
<featureReports failedJobs="0">0</featureReports>
<repairReports failedJobs="0">0</repairReports>
<duration start="1574186027588" finish="1574186028625">00:00:01.037</duration>
</batchSummary>
</report>
russell@russell-XPS-13-9370:~/Downloads$ ~/jhove/jhove -k -m PDF-hul -h xml 0102.pdf
Nov 19, 2019 12:53:54 PM edu.harvard.hul.ois.jhove.JhoveBase process
SEVERE: Validation ended prematurely due to an unhandled exception.
java.lang.NullPointerException
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:280)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:297)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:268)
at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:333)
at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1294)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:811)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:775)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:560)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:432)
at Jhove.main(Jhove.java:281)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schema.openpreservation.org/ois/xml/ns/jhove" xsi:schemaLocation="http://schema.openpreservation.org/ois/xml/ns/jhove https://schema.openpreservation.org/ois/xml/xsd/jhove/1.8/jhove.xsd" name="Jhove" release="1.22.1" date="2019-04-17">
<date>2019-11-19T12:53:54-05:00</date>
<repInfo uri="0102.pdf">
<reportingModule release="1.12.1" date="2019-04-17">PDF-hul</reportingModule>
<lastModified>2019-11-19T12:36:06-05:00</lastModified>
<size>574278</size>
<format>PDF</format>
<status>Unknown</status>
<sigMatch>
<module>PDF-hul</module>
</sigMatch>
<messages>
<message severity="error">Validation ended prematurely due to an unhandled exception.</message>
</messages>
<mimeType>application/pdf</mimeType>
</repInfo>
</jhove>
russell@russell-XPS-13-9370:~/Downloads$
Not sure it is helpful, but I have 19 example files that cause this exception. I've tested and receive the same results on the current release of JHove, which has ReportingModule: PDF-hul, Rel. 1.12.2 (2019-12-10)