jhove
jhove copied to clipboard
PDF-hul: ArrayIndexOutOfBoundsException
Dev Effort
1D - investigation
Description
We have a large number of PDFs that are getting a Java language exception when JHOVE attempts to validate. An example can be downloaded from: http://gac.canadiana.ca/view/ooe.b4222507_008 (Download PDF button is beside the image resize - + buttons.)
russell@russell-desktop2:~/Downloads$ pdfinfo ooe.b4222507_008-document.pdf
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 16
Encrypted: no
Page size: 635.3 x 815.05 pts
Page rot: 0
File size: 10342062 bytes
Optimized: no
PDF version: 1.4
russell@russell-desktop2:~/Downloads$ identify ooe.b4222507_008-document.pdf
ooe.b4222507_008-document.pdf[0] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[1] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[2] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[3] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[4] PBM 635x815 635x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[5] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[6] PBM 645x815 645x815+0+0 16-bit Bilevel Gray 65.3KB 0.010u 0:00.009
ooe.b4222507_008-document.pdf[7] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.009
ooe.b4222507_008-document.pdf[8] PBM 633x822 633x822+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.009
ooe.b4222507_008-document.pdf[9] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[10] PBM 607x813 607x813+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[11] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[12] PBM 607x813 607x813+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[13] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[14] PBM 607x815 607x815+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
ooe.b4222507_008-document.pdf[15] PBM 626x819 626x819+0+0 16-bit Bilevel Gray 65.3KB 0.000u 0:00.000
russell@russell-desktop2:~/Downloads$ /opt/jhove/jhove ooe.b4222507_008-document.pdf
java.lang.ArrayIndexOutOfBoundsException: 710
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:605)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
Jhove (Rel. 1.16.6, 2017-04-27)
Date: 2017-05-01 11:47:40 EDT
RepresentationInformation: ooe.b4222507_008-document.pdf
ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
LastModified: 2017-05-01 11:40:33 EDT
Size: 10342062
Format: bytestream
Status: Well-Formed and valid
SignatureMatches:
PDF-hul
WARC-kb
MIMEtype: application/octet-stream
russell@russell-desktop2:~/Downloads$ /opt/jhove/jhove -m PDF-hul ooe.b4222507_008-document.pdf
java.lang.ArrayIndexOutOfBoundsException: 710
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
Jhove (Rel. 1.16.6, 2017-04-27)
Date: 2017-05-01 11:48:01 EDT
RepresentationInformation: ooe.b4222507_008-document.pdf
ReportingModule: PDF-hul, Rel. 1.8 (2017-03-14)
LastModified: 2017-05-01 11:40:33 EDT
Size: 10342062
Format: PDF
Status: Not well-formed
SignatureMatches:
PDF-hul
ErrorMessage: 585
Offset: 10339461
ErrorMessage: No document catalog dictionary
Offset: 0
MIMEtype: application/pdf
russell@russell-desktop2:~/Downloads$
Note: pdfinfo is from poppler-utils, and identify is from ImageMagick. Identify is able to render all the PDF pages to an image, which is what it does to check if a PDF file is working. The PDF files in question will render in all the PDF viewers we have tested with.
Issue was also discussed in the jhove mailing list. We have a couple thousand PDF files that give a similar report in our repository which might be having the same issue.
If it turns out the problem is with the PDF file and not JHOVE, can someone with more knowledge of the PDF file format document how it is broken so that a report can be sent to https://poppler.freedesktop.org/ (and possibly other projects, but I haven't checked which tools generated all the PDF files that JHOVE is flagging).