pdfminer.six
pdfminer.six copied to clipboard
No /Root object! - Is this really a PDF?
Hi,
I've got this PDF (see attachment) which opens just fine in a PDF viewer but fails to get parsed:
PDFSyntaxError Traceback (most recent call last)
<ipython-input-21-661fe9476e35> in <module>()
7 device = TextConverter(rsrcmgr, outfp, codec="utf-8", laparams=LAParams())
8 interpreter = PDFPageInterpreter(rsrcmgr, device)
----> 9 for page in PDFPage.get_pages(fp, pagenos=set(), caching=True, check_extractable=True):
10 interpreter.process_page(page)
11 device.close()
1 frames
/usr/local/lib/python3.6/dist-packages/pdfminer/pdfpage.py in get_pages(cls, fp, pagenos, maxpages, password, caching, check_extractable)
126 parser = PDFParser(fp)
127 # Create a PDF document object that stores the document structure.
--> 128 doc = PDFDocument(parser, password=password, caching=caching)
129 # Check if the document allows text extraction.
130 # If not, warn the user and proceed.
/usr/local/lib/python3.6/dist-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
594 break
595 else:
--> 596 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
597 if self.catalog.get('Type') is not LITERAL_CATALOG:
598 if settings.STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Steps to reproduce the bug:
f = PDFParser(open(pdf, 'rb'))
doc = PDFDocument(f)
@micmalti I was able to resolve this issue by repairing the PDF via Ghostscript. Command I ran:
gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf
Is this something that pdfminer should be able to handle natively? I don't know.
I've labelled this as an anomaly. I.e. a pdf that cannot be parsed because it deviates from the PDF reference specification. These are currently not a priority for pdfminer.six but it could be in the future.
In general, these problems are fixed by using ghostscript
or mutools
. This suggests that pdfminer.six could do the same.
I get the same error with this PDF file: https://www.ema.europa.eu/documents/product-information/rapamune-epar-product-information_en.pdf
On my end, it seems these errors have started to appear more frequently. Would be great to have tools to detect and handle this on the fly.
@samkit-jain Thank you for the workaround - that helped me tremendously.
I was able to resolve this issue by repairing the PDF via Ghostscript.
Update: Months later I discovered some strange issues with "repaired" PDFs. For example the word "Reflexion" was displayed just fine but Acrobat Reader was unable to find the exakt text when trying to search it. Typing "Renexion" did find the word. When marking the word in the PDF viewer and copying it in a text editor I got "Renexion". (just to be clear: this is not an Acrobat problem - pdfminer did extract the same "bad" word as did PDF readers from Firefox, Chrome and Edge)
A similar error happened in other files. Each of the affected words contained some "f" in there and if a word was affected it was affected in the whole document but not every word with "f" was affected. So you should probably be a bit cautious about ghostscript's "pdfwrite" modue (I used ghostscript 9.56.1 on Fedora).
mutool clean
worked, though (mupdf 1.20.3 on Fedora).
this problem happens when u highlight some words/painting in the pdf with Edge
That 'cos Edge uses Adobe nowadays: https://blogs.windows.com/msedgedev/2023/02/08/adobe-acrobat-microsoft-edge-pdf/
Another way to fix the PDF in code, say python
# Open the existing PDF
with open(filename, "rb") as file:
reader = PyPDF2.PdfReader(file)
# Create a new PDF
repaired_filename = f"{filename.replace('.pdf', '')}_repaired.pdf"
with open(repaired_filename, "wb") as new_file:
writer = PyPDF2.PdfWriter()
# Copy content from old to new
for i in range(len(reader.pages)):
writer.add_page(reader.pages[i])
writer.write(new_file)
Closing this because the issue can be circumvented by repairing the pdf.