PyPDF4 icon indicating copy to clipboard operation
PyPDF4 copied to clipboard

extractText() in PyPDF4 not working while working in PyPDF2

Open michd89 opened this issue 5 years ago • 2 comments

I have the following file: zen_of_python_corrupted.pdf According to the PDF's internal code the text content is somehow corrupted/compressed/differently encoded. However it works fine when opened with a PDF viewer.

Now I want to extract the text in Python. With PyPDF2 it looks like this:

import PyPDF2
reader = PyPDF2.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.getNumPages()):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

And indeed it prints me the Zen of Python.

With PyPDF4 it is:

import PyPDF4
reader = PyPDF4.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.numPages):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

But there I only get: Error -3 while decompressing data: incorrect data check

Since I don't find this particular error message within the code of PyPDF4 I consider that the error lies within a third party library. But still I find it odd that it works on the older PyPDF2. Do you have any idea about this? Does it work on your systems if you try it out?

michd89 avatar Mar 04 '19 13:03 michd89

pdfobject = open('test.pdf','rb')


pdfReader = PyPDF4.PdfFileReader(pdfobject,strict=False)
pdfWriter = PyPDF4.PdfFileWriter()
#print(bdc)
for pageNum in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)

    TEST = pageObj.extractText()
    print(TEST)
    if str('complex') in TEST:
    ......
    print('complex')
    pdfWriter.addPage(pageObj)

areneededtoobtainafinalresulttheyarenormallyincludedinfull;thisshould enabletheinstructortodeterminewhetherastudent’sincorrectanswerisdueto amisunderstandingofprinciplesortoatechnicalerror. Inallnewpublications,onpaperoronawebsite,errorsandtypographical mistakesarevirtuallyunavoidableandwewouldbegratefultoanyinstructor whobringsinstancestoourattention. KenRiley,[email protected], MichaelHobson,[email protected], Cambridge,2006 xx



PdfStreamError                            Traceback (most recent call last)

<ipython-input-5-d2518890f7f6> in <module>
      9     pageObj = pdfReader.getPage(pageNum)
     10 
---> 11     TEST = pageObj.extractText()
     12     print(TEST)
     13     if str('complex') in TEST:

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in extractText(self)
   2659         content = self["/Contents"].getObject()
   2660         if not isinstance(content, ContentStream):
-> 2661             content = ContentStream(content, self.pdf)
   2662         # Note: we check all strings are TextStringObjects.  ByteStringObjects
   2663         # are strings where the byte->string encoding was unknown, so adding

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __init__(self, stream, pdf)
   2739         else:
   2740             stream = BytesIO(b_(stream.getData()))
-> 2741         self.__parseContentStream(stream)
   2742 
   2743     def __parseContentStream(self, stream):

~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __parseContentStream(self, stream)
   2771                     peek = stream.read(1)
   2772             else:
-> 2773                 operands.append(readObject(stream, None))
   2774 
   2775     def _readInlineImage(self, stream):

~\Anaconda\lib\site-packages\PyPDF4\generic.py in readObject(stream, pdf)
     75     elif idx == 5:
     76         # string object
---> 77         return readStringFromStream(stream)
     78     elif idx == 6:
     79         # null object

~\Anaconda\lib\site-packages\PyPDF4\generic.py in readStringFromStream(stream)
    332         if not tok:
    333             # stream has truncated prematurely
--> 334             raise PdfStreamError("Stream has ended unexpectedly")
    335         if tok == b_("("):
    336             parens += 1

PdfStreamError: Stream has ended unexpectedly

extractText() works but seems to struggle handling particular things inside the text.

bigcats01 avatar Mar 12 '20 15:03 bigcats01

I have Python 3.11 and pypdf installed. pip freeze
pypdf==4.1.0

In case others struggle with the same task. Here's what worked for me with a correct pdf: Here's the documentation

from pypdf import PdfReader

reader = PdfReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader._get_num_pages()):
    page = reader.pages[pagenum]
    text = page.extract_text()
    print(text)

Schaekermann avatar Apr 07 '24 15:04 Schaekermann