CERMINE TrueViz extraction fails silently for some PDFs

TrueViz extraction fails silently for some PDFs

Open afs25 opened this issue 5 years ago • 0 comments

First of all, thank you for developing CERMINE. I am very impressed by what it can do.

One of the projects I am working at the moment relies on identifying some elements of the layout of PDF files, so I am particularly interested in parsing the TrueViz XML output of CERMINE. I noticed that for some PDFs, CERMINE fails silently to output the content in TrueViz format. The resulting .cermstr file does not contain any Zone, Word or Character elements inside each of the Page elements:

Unfortunately I cannot post the problematic PDF here because it is copyrighted (I am happy to send the PDF in a personal message if requested), but I will post an example as soon as I come across one that can be shared.

Is there any way I can inspect debug information from CERMINE to try to understand what is special about this PDF and how I can go about fixing this? In other words, can the verbosity of CERMINE be increased somehow? Perhaps pre-processing the PDF with pdftk or ghostscript might solve the problem, but it is difficult to implement that without understanding the underlying problem.

Thank you in advance for any help!

Aug 20 '19 13:08 afs25

CERMINE CERMINE copied to clipboard

TrueViz extraction fails silently for some PDFs

CERMINE
CERMINE copied to clipboard