pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

PdfReadError: Unexpected end of stream

Open MartinThoma opened this issue 1 year ago • 0 comments

I wanted to extract text from a PDF

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The pdf: pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf')
>>> reader.metadata
{'/ModDate': "D:20051220065746-05'00'", '/CreationDate': "D:20051220065728-05'00'", '/Producer': 'Creo Normalizer JTP'}
>>> for page in reader.pages: print(page)
... 
{'/Annots': IndirectObject(8, 0, 139985924129648), '/Contents': [IndirectObject(20, 0, 139985924129648), IndirectObject(21, 0, 139985924129648), IndirectObject(22, 0, 139985924129648), IndirectObject(23, 0, 139985924129648), IndirectObject(24, 0, 139985924129648), IndirectObject(25, 0, 139985924129648), IndirectObject(30, 0, 139985924129648), IndirectObject(31, 0, 139985924129648)], '/Type': '/Page', '/Parent': IndirectObject(1, 0, 139985924129648), '/Rotate': 0, '/MediaBox': [72, 72, 684, 864], '/CropBox': [72, 72, 684, 864], '/BleedBox': [72, 72, 684, 864], '/TrimBox': [72, 72, 684, 864], '/ArtBox': [0, 0, 756, 936], '/Resources': IndirectObject(12, 0, 139985924129648), '/HDAG_Tools': IndirectObject(67, 0, 139985924129648), '/CREO_Tools': IndirectObject(68, 0, 139985924129648), '/CREO_Orientation': 0, '/CREO_ScaleFactor': [1, 1]}
>>> for page in reader.pages: print(page.extract_text())
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1316, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1138, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1196, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1212, in __parse_content_stream
    ii = self._read_inline_image(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1253, in _read_inline_image
    raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream

MartinThoma avatar Jul 10 '22 09:07 MartinThoma