jhove icon indicating copy to clipboard operation
jhove copied to clipboard

validation process with specific pdf stuck in infinite loop

Open digitalpreservationhero opened this issue 5 years ago • 8 comments

Validation process with the attached pdf ran for over 12 hours using JHOVE 1.22 pdf-hul before I aborted the process. I assume, there's something in the pdf structure which causes JHOVE to stuck in an infinite loop. infinite_loop.pdf

digitalpreservationhero avatar Sep 04 '19 08:09 digitalpreservationhero

Hi @digitalpreservationhero, thanks for raising this. I've just tried this on the new v1.24 release candidate and it no longer runs in an infinite loop, or at least not for me. It now fails with a StackOverflowError and the offending (recursive) method appears to be edu.harvard.hul.ois.jhove.module.pdf.AProfile.checkOutlineItem(AProfile.java:864). This isn't going to be fixed for the upcoming release but I'll prioritise it for the next release.

carlwilson avatar Dec 10 '19 23:12 carlwilson

I was interested to see if there was any relation to https://github.com/openpreserve/jhove/issues/115

In terms of logic, I'd need to look into this in more detail to know if there is a relation. In terms of behavior then it looks like JHove 1.25 integration branch (and 1.24 did) exits with a stack overflow and stack trace now and no longer enters an infinite loop:

Exception in thread "main" java.lang.StackOverflowError
	at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:304)
	at java.lang.StringCoding.encode(StringCoding.java:344)
	at java.lang.StringCoding.encode(StringCoding.java:387)
	at java.lang.String.getBytes(String.java:958)
	at edu.harvard.hul.ois.jhove.module.pdf.Name.isPdfACompliant(Name.java:23)
	at edu.harvard.hul.ois.jhove.module.pdf.Tokenizer.getNext(Tokenizer.java:424)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Parser.java:94)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.getNext(Parser.java:83)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:266)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readDictionary(Parser.java:333)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObject(Parser.java:271)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Parser.java:222)
	at edu.harvard.hul.ois.jhove.module.pdf.Parser.readObjectDef(Parser.java:198)
	at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2722)
	at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2692)
	at edu.harvard.hul.ois.jhove.module.pdf.AProfile.checkOutlineItem(AProfile.java:856)
	at edu.harvard.hul.ois.jhove.module.pdf.AProfile.checkOutlineItem(AProfile.java:864)
	at edu.harvard.hul.ois.jhove.module.pdf.AProfile.checkOutlineItem(AProfile.java:864)
	at edu.harvard.hul.ois.jhove.module.pdf.AProfile.checkOutlineItem(AProfile.java:864)
        ... repeats 1009 times ...

Out of interest I ran Tika on this file to get a bit more info on the file. That's here: tika-infinite-loop-output.txt

ross-spencer avatar May 06 '20 12:05 ross-spencer

I have seen similar behaviour (seemingly infinite spinning whilst analysing) on the attached PDF. fulltext.pdf

jackdos avatar May 07 '20 16:05 jackdos

@jackdos thanks for the link, I was in the module so I took a look and it's seems like your file is getting hung up in around the same area. I'm not up on my Java, the pattern for the loop/iterator isn't one I'm used to, but that might be okay. Though the callers trying to return "next" seem to never return null but always a reference to an object that doesn't seem to evaluate to null. Compared to the original author's it's not returning a stack overflow for me so maybe a slightly different issue. I couldn't quite figure out the change that was needed, but hopefully this helps folks a little.

https://github.com/openpreserve/jhove/blob/8677ad043a59d93b0dbe949047ef064bc592bb08/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/AProfile.java#L819-L847

ross-spencer avatar May 08 '20 03:05 ross-spencer

I just had a look at that block of code for the file I attached. I don't see anything wrong with the pattern itself.

It seems to get itself into an infinite loop because at object 116 the item returned by calling "get("Next")" seems to be the item itself (at least, it has the same _objNumber). It looks like the readDictionary method in Parser isn't handling object 116 very well. From what I can tell it reads through the first set of things correctly, gets to the Title token, then the next token it reads over-runs the end of the object, and the beginning of the next object, such that the next token it reads is actually the "Next" for object 117 (which is object 116, hence the endless loop).

This is around line 5482 in the pdf (in a text editor). I've had a look at I can't really see what is wrong with that object. To me, it looks basically identical to object 115, but I might be missing some subtle difference that is making the parser skip straight through the endobj marker.

Hopefully someone who understands PDF dictionaries will be able to point out what's wrong, and we can work out how to change the parser to detect the error.

jackdos avatar May 09 '20 15:05 jackdos

Also wondering, given the stack trace and description, whether #306 is related.

jackdos avatar May 11 '20 08:05 jackdos

Nice. Looking through the Title property, I can see that it's encoded as UTF-16, and includes escaped parentheses and at least one escaped newline, so the parser is probably over-running the bounds of the Title due to issue #277, after which all bets are off as to what it thinks it's reading.

david-russo avatar May 11 '20 11:05 david-russo