Only last %%EOF is considered, possibly not detecting valid startxref
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-11-10.0.26100-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.4.0, crypt_provider=('cryptography', '44.0.2'), PIL=10.4.0
Code + PDF
This is a minimal, complete example that shows the issue:
>>> theFile = r"C:\Users\tvrom\Documents\eqbPDFChartPlus.pdf"
>>> from pypdf import PdfReader
>>> reader = PdfReader(theFile)
Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!
Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 136, in __init__
self._initialize_stream(stream)
File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 158, in _initialize_stream
self.read(stream)
File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 594, in read
startxref = self._find_startxref_pos(stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 726, in _find_startxref_pos
raise PdfReadError("startxref not found")
pypdf.errors.PdfReadError: startxref not found
Thanks for the report. I have reformatted your comment to increase the readability.
Regarding the error itself: The PDF file footer looks odd:
...
0000160061 00000 n
trailer
<</ID [<96b716a3c339e63766720bcf668c73e2><96b716a3c339e63766720bcf668c73e2>]/Root 13 0 R/Encrypt 80 0 R/Size 82/Info 81 0 R>>
startxref
160113
%%EOF
tartxref
160135
%%EOF
We could circumvent this by further looking for another startxref, but depending on how you see it, this PDF file is more or less broken.
I have multiple other PDFs from the same source that read fine so it is possible that this one is just different. Attached is one that reads without errors.
Which is valid/does not have two %%EOF entries, where one would use tartxref instead of startxref:
...
0000168799 00000 n
trailer
<</ID [<e91f81bc68c6534c92c7bb22765e110a><e91f81bc68c6534c92c7bb22765e110a>]/Root 13 0 R/Encrypt 74 0 R/Size 76/Info 75 0 R>>
startxref
168851
%%EOF
Hello! I have been processing lots of badly formatted pdfs recently (mainly produced by crappy accounting software) so I have bumped in all sorts of issues like the one described here. My catch all solution was to use ghostscript to repair the document anytime I failed to read it via the PdfReader class.
def repair_pdf_structure(pdf_path: str) -> str:
"""
Repair pdf file using ghostscript. Implementation for Ubuntu.
"""
repaired_pdf_path = pdf_path.replace(".pdf", "_repaired.pdf")
os.system(f"gs -o {repaired_pdf_path} -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress {pdf_path}")
return repaired_pdf_path
Hope this saves you some grief. Best,
Marco
This depends on the use case. I am using similar approaches with Ghostscript and MuPDF, but please keep in mind that they might change much more than expected, for example forms, images etc.
And: You probably want to use subprocess.run() with a list instead of os.system() with a string to avoid security issues.