pypdf
pypdf copied to clipboard
PyPDF2.errors.PdfReadError: EOF marker not found
The following script originally hanged, but with PyPDF2==2.4.2 we get PdfReadError: EOF marker not found
.
MCVE: PDF + Code
This file is 298MB with 21 pages.
from PyPDF2 import PdfReader
reader = PdfReader("01-006 2009-04-30 FRP NON CONFIDENTIAL PAP FILING.PDF")
Traceback
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 267, in __init__
self.read(stream)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1218, in read
raise PdfReadError("EOF marker not found")
PyPDF2.errors.PdfReadError: EOF marker not found
PyPDF2==1.27.7
gives:
Traceback (most recent call last):
File "/home/moose/foo.py", line 3, in <module>
reader = PdfFileReader("01-006 2009-04-30 FRP NON CONFIDENTIAL PAP FILING.PDF")
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/pdf.py", line 1208, in __init__
self.read(stream)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/pdf.py", line 1828, in read
line = self.readNextEndLine(stream, last1K)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/pdf.py", line 2078, in readNextEndLine
raise PdfReadError("Could not read malformed PDF file")
PyPDF2.errors.PdfReadError: Could not read malformed PDF file
Also with strict=False
that is the part:
# Prevent infinite loops in malformed PDFs
if stream.tell() == 0 or stream.tell() == limit_offset:
raise PdfReadError("Could not read malformed PDF file")
the PDF is very odd... about 100 MB of null characters after removing the threshold to look for the end marker, it can be opened. also some lines in the cmap needs to be stripped....
@MartinThoma, this issue should be release, isn't it ?
Thank you for the reminder :-)