pypdf Only last %%EOF is considered, possibly not detecting valid startxref

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.26100-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.4.0, crypt_provider=('cryptography', '44.0.2'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

>>> theFile = r"C:\Users\tvrom\Documents\eqbPDFChartPlus.pdf"
>>> from pypdf import PdfReader 
>>> reader = PdfReader(theFile)

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

eqbPDFChartPlus.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 136, in __init__
    self._initialize_stream(stream)
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 158, in _initialize_stream
    self.read(stream)
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 594, in read
    startxref = self._find_startxref_pos(stream)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 726, in _find_startxref_pos
    raise PdfReadError("startxref not found")
pypdf.errors.PdfReadError: startxref not found

Apr 06 '25 15:04 TVR1023

Thanks for the report. I have reformatted your comment to increase the readability.

Regarding the error itself: The PDF file footer looks odd:

...
0000160061 00000 n 
trailer
<</ID [<96b716a3c339e63766720bcf668c73e2><96b716a3c339e63766720bcf668c73e2>]/Root 13 0 R/Encrypt 80 0 R/Size 82/Info 81 0 R>>
startxref
160113
%%EOF
tartxref
160135
%%EOF

We could circumvent this by further looking for another startxref, but depending on how you see it, this PDF file is more or less broken.

Apr 06 '25 15:04 stefan6419846

I have multiple other PDFs from the same source that read fine so it is possible that this one is just different. Attached is one that reads without errors.

ResultCharts_RP_10_13_2021.pdf

Apr 06 '25 16:04 TVR1023

Which is valid/does not have two %%EOF entries, where one would use tartxref instead of startxref:

...
0000168799 00000 n 
trailer
<</ID [<e91f81bc68c6534c92c7bb22765e110a><e91f81bc68c6534c92c7bb22765e110a>]/Root 13 0 R/Encrypt 74 0 R/Size 76/Info 75 0 R>>
startxref
168851
%%EOF

Apr 07 '25 06:04 stefan6419846

Hello! I have been processing lots of badly formatted pdfs recently (mainly produced by crappy accounting software) so I have bumped in all sorts of issues like the one described here. My catch all solution was to use ghostscript to repair the document anytime I failed to read it via the PdfReader class.

def repair_pdf_structure(pdf_path: str) -> str:
    """
    Repair pdf file using ghostscript. Implementation for Ubuntu. 
    """
    repaired_pdf_path = pdf_path.replace(".pdf", "_repaired.pdf")
    os.system(f"gs -o {repaired_pdf_path} -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress {pdf_path}")
    return repaired_pdf_path

Hope this saves you some grief. Best,

Marco

Jun 28 '25 21:06 mrcghil

This depends on the use case. I am using similar approaches with Ghostscript and MuPDF, but please keep in mind that they might change much more than expected, for example forms, images etc.

And: You probably want to use subprocess.run() with a list instead of os.system() with a string to avoid security issues.

Jun 29 '25 08:06 stefan6419846