pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Only last %%EOF is considered, possibly not detecting valid startxref

Open TVR1023 opened this issue 8 months ago • 5 comments

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.26100-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.4.0, crypt_provider=('cryptography', '44.0.2'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

>>> theFile = r"C:\Users\tvrom\Documents\eqbPDFChartPlus.pdf"
>>> from pypdf import PdfReader 
>>> reader = PdfReader(theFile)

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

eqbPDFChartPlus.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 136, in __init__
    self._initialize_stream(stream)
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 158, in _initialize_stream
    self.read(stream)
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 594, in read
    startxref = self._find_startxref_pos(stream)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tvrom\AppData\Local\Programs\Python\Python312\Lib\site-packages\pypdf\_reader.py", line 726, in _find_startxref_pos
    raise PdfReadError("startxref not found")
pypdf.errors.PdfReadError: startxref not found

TVR1023 avatar Apr 06 '25 15:04 TVR1023

Thanks for the report. I have reformatted your comment to increase the readability.

Regarding the error itself: The PDF file footer looks odd:

...
0000160061 00000 n 
trailer
<</ID [<96b716a3c339e63766720bcf668c73e2><96b716a3c339e63766720bcf668c73e2>]/Root 13 0 R/Encrypt 80 0 R/Size 82/Info 81 0 R>>
startxref
160113
%%EOF
tartxref
160135
%%EOF

We could circumvent this by further looking for another startxref, but depending on how you see it, this PDF file is more or less broken.

stefan6419846 avatar Apr 06 '25 15:04 stefan6419846

I have multiple other PDFs from the same source that read fine so it is possible that this one is just different. Attached is one that reads without errors.

ResultCharts_RP_10_13_2021.pdf

TVR1023 avatar Apr 06 '25 16:04 TVR1023

Which is valid/does not have two %%EOF entries, where one would use tartxref instead of startxref:

...
0000168799 00000 n 
trailer
<</ID [<e91f81bc68c6534c92c7bb22765e110a><e91f81bc68c6534c92c7bb22765e110a>]/Root 13 0 R/Encrypt 74 0 R/Size 76/Info 75 0 R>>
startxref
168851
%%EOF

stefan6419846 avatar Apr 07 '25 06:04 stefan6419846

Hello! I have been processing lots of badly formatted pdfs recently (mainly produced by crappy accounting software) so I have bumped in all sorts of issues like the one described here. My catch all solution was to use ghostscript to repair the document anytime I failed to read it via the PdfReader class.

def repair_pdf_structure(pdf_path: str) -> str:
    """
    Repair pdf file using ghostscript. Implementation for Ubuntu. 
    """
    repaired_pdf_path = pdf_path.replace(".pdf", "_repaired.pdf")
    os.system(f"gs -o {repaired_pdf_path} -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress {pdf_path}")
    return repaired_pdf_path

Hope this saves you some grief. Best,

Marco

mrcghil avatar Jun 28 '25 21:06 mrcghil

This depends on the use case. I am using similar approaches with Ghostscript and MuPDF, but please keep in mind that they might change much more than expected, for example forms, images etc.

And: You probably want to use subprocess.run() with a list instead of os.system() with a string to avoid security issues.

stefan6419846 avatar Jun 29 '25 08:06 stefan6419846