jhove icon indicating copy to clipboard operation
jhove copied to clipboard

Example file generates "java.lang.NullPointerException" from PDF-hul

Open RussellMcOrmond opened this issue 7 years ago • 6 comments

Dev Effort

1D

Description

Attaching 0694.pdf

File validates with veraPDF, but jhove fails with a Java exception. We have introduced JHOVE into our production process and while processing can go forward for a file JHOVE considers invalid, it currently stops processing if JHOVE has an exception. Rather than adjusting our processing to allow exceptions of this type, I'm wondering if this is something that can be looked into within JHOVE?

cihm@quark:~$ pdfinfo /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf 
Producer:       ABBYY Recognition Server
CreationDate:   Sat Apr 29 02:25:05 2017
ModDate:        Sat Apr 29 02:25:05 2017
Tagged:         yes
Form:           none
Pages:          1
Encrypted:      no
Page size:      670.45 x 1158.9 pts
Page rot:       0
File size:      719683 bytes
Optimized:      no
PDF version:    1.4
cihm@quark:~$ /opt/jhove/jhove -k -m PDF-hul -h xml /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf
Jun 02, 2017 2:03:09 PM Jhove main
SEVERE: null
java.lang.NullPointerException
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:292)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:310)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:280)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:346)
	at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1065)
	at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:498)
	at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
	at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
	at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
	at Jhove.main(Jhove.java:292)
cihm@quark:~$ 
russell@russell-desktop:~$ verapdf/verapdf /opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf
<?xml version="1.0" encoding="utf-8"?>
  <report>
    <buildInformation><releaseDetails id="core" version="1.2.2" buildDate="2017-03-01T20:46:00-05:00"></releaseDetails><releaseDetails id="gui" version="1.2.1-PDFBOX" buildDate="2017-03-01T22:00:00-05:00"></releaseDetails><releaseDetails id="pdfbox-validation-model" version="1.2.2" buildDate="2017-03-01T20:56:00-05:00"></releaseDetails></buildInformation>
  
    <jobs>
      <job><item size="719683"><name>/opt/wip/Rejected/heritage_ocr/lac_reel_t3953/0694.pdf</name></item><validationReport profileName="PDF/A-1A validation profile" statement="PDF file is compliant with Validation Profile requirements." isCompliant="true"><details passedRules="106" failedRules="0" passedChecks="7794" failedChecks="0"></details></validationReport>
        <processingTime>00:00:01:325</processingTime>
      </job>
    </jobs>
  <summary jobs="1" failedJobs="0" valid="1" inValid="0" validExcep="0" features="0"><duration start="1496426672185" finish="1496426673946">00:00:01:761</duration></summary>
</report>
russell@russell-desktop:~$ 

RussellMcOrmond avatar Jun 02 '17 18:06 RussellMcOrmond

Clearly a PDF module parser bug. There have been a ton of those. Sorry.

gmcgath avatar Jun 02 '17 18:06 gmcgath

I don't have the resources or JAVA skills at the moment to be of much help. In the interim I can run files against the bytestream module if they have this type of failure when using the matching module.

Is this something that is going to wait until veraPDF integration to fix, or something that someone might take a look at earlier? This was a random PDF file among many thousands that JHOVE processed without error (We now generate single-page PDF files from OCR for most images ingested into our TDR, and generate a JHOVE report for each file ingested).

RussellMcOrmond avatar Jun 06 '17 18:06 RussellMcOrmond

A change has been committed for the next release which should, in many cases, allow JHOVE to continue processing files after a module has failed to parse a file, such as with the example you provided earlier.

david-russo avatar Oct 03 '17 09:10 david-russo

I'm currently using JHOVE 1.22, seeing what I think is a related problem. While the previous example I posted can be run without warnings, other files have the same warning.

Example: oocihm.28876/data/sip/data/files/0102.pdf 0102.pdf

russell@russell-XPS-13-9370:~/Downloads$ pdfinfo 0102.pdf 
Producer:       ABBYY Recognition Server
CreationDate:   Tue Nov 13 12:05:04 2018 EST
ModDate:        Tue Nov 13 12:05:04 2018 EST
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      522.25 x 699.35 pts
Page rot:       0
File size:      574278 bytes
Optimized:      no
PDF version:    1.4
russell@russell-XPS-13-9370:~/Downloads$ ~/verapdf/verapdf 0102.pdf 
<?xml version="1.0" encoding="utf-8"?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.14.105" buildDate="2019-10-24T22:54:00-04:00"></releaseDetails>
    <releaseDetails id="gui" version="1.14.8" buildDate="2019-10-24T23:11:00-04:00"></releaseDetails>
    <releaseDetails id="pdfbox-validation-model" version="1.14.105" buildDate="2019-10-24T23:01:00-04:00"></releaseDetails>
  </buildInformation>
  <jobs>
    <job>
      <item size="574278">
        <name>/home/russell/Downloads/0102.pdf</name>
      </item>
      <validationReport profileName="PDF/A-1A validation profile" statement="PDF file is compliant with Validation Profile requirements." isCompliant="true">
        <details passedRules="107" failedRules="0" passedChecks="7923" failedChecks="0"></details>
      </validationReport>
      <duration start="1574186027705" finish="1574186028607">00:00:00.902</duration>
    </job>
  </jobs>
  <batchSummary totalJobs="1" failedToParse="0" encrypted="0">
    <validationReports compliant="1" nonCompliant="0" failedJobs="0">1</validationReports>
    <featureReports failedJobs="0">0</featureReports>
    <repairReports failedJobs="0">0</repairReports>
    <duration start="1574186027588" finish="1574186028625">00:00:01.037</duration>
  </batchSummary>
</report>
russell@russell-XPS-13-9370:~/Downloads$ ~/jhove/jhove -k -m PDF-hul -h xml 0102.pdf 
Nov 19, 2019 12:53:54 PM edu.harvard.hul.ois.jhove.JhoveBase process
SEVERE: Validation ended prematurely due to an unhandled exception.
java.lang.NullPointerException
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:280)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readArray(Parser.java:297)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:268)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:333)
	at edu.harvard.hul.ois.jhove.module.PdfModule.parseTrailer(PdfModule.java:1294)
	at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:811)
	at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:775)
	at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:560)
	at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:432)
	at Jhove.main(Jhove.java:281)

<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schema.openpreservation.org/ois/xml/ns/jhove" xsi:schemaLocation="http://schema.openpreservation.org/ois/xml/ns/jhove https://schema.openpreservation.org/ois/xml/xsd/jhove/1.8/jhove.xsd" name="Jhove" release="1.22.1" date="2019-04-17">
 <date>2019-11-19T12:53:54-05:00</date>
 <repInfo uri="0102.pdf">
  <reportingModule release="1.12.1" date="2019-04-17">PDF-hul</reportingModule>
  <lastModified>2019-11-19T12:36:06-05:00</lastModified>
  <size>574278</size>
  <format>PDF</format>
  <status>Unknown</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message severity="error">Validation ended prematurely due to an unhandled exception.</message>
  </messages>
  <mimeType>application/pdf</mimeType>
 </repInfo>
</jhove>
russell@russell-XPS-13-9370:~/Downloads$ 

RussellMcOrmond avatar Nov 19 '19 17:11 RussellMcOrmond

Not sure it is helpful, but I have 19 example files that cause this exception. I've tested and receive the same results on the current release of JHove, which has ReportingModule: PDF-hul, Rel. 1.12.2 (2019-12-10)

RussellMcOrmond avatar Jul 27 '20 23:07 RussellMcOrmond