pypdf
pypdf copied to clipboard
PdfReadError: Unexpected end of stream
I wanted to extract text from a PDF
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2
Code + PDF
The pdf: pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf
>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5c7a7f24459bcb9700d650062e0ab8bb.pdf')
>>> reader.metadata
{'/ModDate': "D:20051220065746-05'00'", '/CreationDate': "D:20051220065728-05'00'", '/Producer': 'Creo Normalizer JTP'}
>>> for page in reader.pages: print(page)
...
{'/Annots': IndirectObject(8, 0, 139985924129648), '/Contents': [IndirectObject(20, 0, 139985924129648), IndirectObject(21, 0, 139985924129648), IndirectObject(22, 0, 139985924129648), IndirectObject(23, 0, 139985924129648), IndirectObject(24, 0, 139985924129648), IndirectObject(25, 0, 139985924129648), IndirectObject(30, 0, 139985924129648), IndirectObject(31, 0, 139985924129648)], '/Type': '/Page', '/Parent': IndirectObject(1, 0, 139985924129648), '/Rotate': 0, '/MediaBox': [72, 72, 684, 864], '/CropBox': [72, 72, 684, 864], '/BleedBox': [72, 72, 684, 864], '/TrimBox': [72, 72, 684, 864], '/ArtBox': [0, 0, 756, 936], '/Resources': IndirectObject(12, 0, 139985924129648), '/HDAG_Tools': IndirectObject(67, 0, 139985924129648), '/CREO_Tools': IndirectObject(68, 0, 139985924129648), '/CREO_Orientation': 0, '/CREO_ScaleFactor': [1, 1]}
>>> for page in reader.pages: print(page.extract_text())
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1316, in extract_text
return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1138, in _extract_text
content = ContentStream(content, pdf, "bytes")
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1196, in __init__
self.__parse_content_stream(stream_bytes)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1212, in __parse_content_stream
ii = self._read_inline_image(stream)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1253, in _read_inline_image
raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream