pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

No /Root object! - Is this really a PDF?

Open micmalti opened this issue 4 years ago • 5 comments

Hi,

I've got this PDF (see attachment) which opens just fine in a PDF viewer but fails to get parsed:

PDFSyntaxError                            Traceback (most recent call last)

<ipython-input-21-661fe9476e35> in <module>()
      7 device = TextConverter(rsrcmgr, outfp, codec="utf-8", laparams=LAParams())
      8 interpreter = PDFPageInterpreter(rsrcmgr, device)
----> 9 for page in PDFPage.get_pages(fp, pagenos=set(), caching=True, check_extractable=True):
     10     interpreter.process_page(page)
     11 device.close()

1 frames

/usr/local/lib/python3.6/dist-packages/pdfminer/pdfpage.py in get_pages(cls, fp, pagenos, maxpages, password, caching, check_extractable)
    126         parser = PDFParser(fp)
    127         # Create a PDF document object that stores the document structure.
--> 128         doc = PDFDocument(parser, password=password, caching=caching)
    129         # Check if the document allows text extraction.
    130         # If not, warn the user and proceed.

/usr/local/lib/python3.6/dist-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
    594                 break
    595         else:
--> 596             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    597         if self.catalog.get('Type') is not LITERAL_CATALOG:
    598             if settings.STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

Steps to reproduce the bug:

f = PDFParser(open(pdf, 'rb'))
doc = PDFDocument(f)

W020160201380721221093.pdf

micmalti avatar Aug 23 '20 00:08 micmalti

@micmalti I was able to resolve this issue by repairing the PDF via Ghostscript. Command I ran:

gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf

The repaired PDF.

Is this something that pdfminer should be able to handle natively? I don't know.

samkit-jain avatar Sep 16 '20 12:09 samkit-jain

I've labelled this as an anomaly. I.e. a pdf that cannot be parsed because it deviates from the PDF reference specification. These are currently not a priority for pdfminer.six but it could be in the future.

In general, these problems are fixed by using ghostscript or mutools. This suggests that pdfminer.six could do the same.

pietermarsman avatar Sep 17 '20 18:09 pietermarsman

I get the same error with this PDF file: https://www.ema.europa.eu/documents/product-information/rapamune-epar-product-information_en.pdf

speleo3 avatar Jun 08 '22 15:06 speleo3

On my end, it seems these errors have started to appear more frequently. Would be great to have tools to detect and handle this on the fly.

rain01 avatar Jun 20 '22 17:06 rain01

@samkit-jain Thank you for the workaround - that helped me tremendously.

FelixSchwarz avatar Aug 23 '22 14:08 FelixSchwarz

I was able to resolve this issue by repairing the PDF via Ghostscript.

Update: Months later I discovered some strange issues with "repaired" PDFs. For example the word "Reflexion" was displayed just fine but Acrobat Reader was unable to find the exakt text when trying to search it. Typing "Renexion" did find the word. When marking the word in the PDF viewer and copying it in a text editor I got "Renexion". (just to be clear: this is not an Acrobat problem - pdfminer did extract the same "bad" word as did PDF readers from Firefox, Chrome and Edge)

A similar error happened in other files. Each of the affected words contained some "f" in there and if a word was affected it was affected in the whole document but not every word with "f" was affected. So you should probably be a bit cautious about ghostscript's "pdfwrite" modue (I used ghostscript 9.56.1 on Fedora).

mutool clean worked, though (mupdf 1.20.3 on Fedora).

FelixSchwarz avatar Oct 21 '22 15:10 FelixSchwarz

this problem happens when u highlight some words/painting in the pdf with Edge

Eririf avatar Oct 30 '23 00:10 Eririf

That 'cos Edge uses Adobe nowadays: https://blogs.windows.com/msedgedev/2023/02/08/adobe-acrobat-microsoft-edge-pdf/

petervwyatt avatar Oct 30 '23 05:10 petervwyatt

Another way to fix the PDF in code, say python

        # Open the existing PDF
        with open(filename, "rb") as file:
            reader = PyPDF2.PdfReader(file)

            # Create a new PDF
            repaired_filename = f"{filename.replace('.pdf', '')}_repaired.pdf"
            with open(repaired_filename, "wb") as new_file:
                writer = PyPDF2.PdfWriter()

                # Copy content from old to new
                for i in range(len(reader.pages)):
                    writer.add_page(reader.pages[i])

                writer.write(new_file)

varun-ml avatar Dec 05 '23 14:12 varun-ml

Closing this because the issue can be circumvented by repairing the pdf.

pietermarsman avatar Dec 22 '23 20:12 pietermarsman