PyPDF4
PyPDF4 copied to clipboard
extractText() in PyPDF4 not working while working in PyPDF2
I have the following file: zen_of_python_corrupted.pdf According to the PDF's internal code the text content is somehow corrupted/compressed/differently encoded. However it works fine when opened with a PDF viewer.
Now I want to extract the text in Python. With PyPDF2 it looks like this:
import PyPDF2
reader = PyPDF2.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.getNumPages()):
page = reader.getPage(pagenum)
text = page.extractText()
print(text)
And indeed it prints me the Zen of Python.
With PyPDF4 it is:
import PyPDF4
reader = PyPDF4.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.numPages):
page = reader.getPage(pagenum)
text = page.extractText()
print(text)
But there I only get:
Error -3 while decompressing data: incorrect data check
Since I don't find this particular error message within the code of PyPDF4 I consider that the error lies within a third party library. But still I find it odd that it works on the older PyPDF2. Do you have any idea about this? Does it work on your systems if you try it out?
pdfobject = open('test.pdf','rb')
pdfReader = PyPDF4.PdfFileReader(pdfobject,strict=False)
pdfWriter = PyPDF4.PdfFileWriter()
#print(bdc)
for pageNum in range(pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
TEST = pageObj.extractText()
print(TEST)
if str('complex') in TEST:
......
print('complex')
pdfWriter.addPage(pageObj)
areneededtoobtainafinalresulttheyarenormallyincludedinfull;thisshould enabletheinstructortodeterminewhetherastudent’sincorrectanswerisdueto amisunderstandingofprinciplesortoatechnicalerror. Inallnewpublications,onpaperoronawebsite,errorsandtypographical mistakesarevirtuallyunavoidableandwewouldbegratefultoanyinstructor whobringsinstancestoourattention. KenRiley,[email protected], MichaelHobson,[email protected], Cambridge,2006 xx
PdfStreamError Traceback (most recent call last)
<ipython-input-5-d2518890f7f6> in <module>
9 pageObj = pdfReader.getPage(pageNum)
10
---> 11 TEST = pageObj.extractText()
12 print(TEST)
13 if str('complex') in TEST:
~\Anaconda\lib\site-packages\PyPDF4\pdf.py in extractText(self)
2659 content = self["/Contents"].getObject()
2660 if not isinstance(content, ContentStream):
-> 2661 content = ContentStream(content, self.pdf)
2662 # Note: we check all strings are TextStringObjects. ByteStringObjects
2663 # are strings where the byte->string encoding was unknown, so adding
~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __init__(self, stream, pdf)
2739 else:
2740 stream = BytesIO(b_(stream.getData()))
-> 2741 self.__parseContentStream(stream)
2742
2743 def __parseContentStream(self, stream):
~\Anaconda\lib\site-packages\PyPDF4\pdf.py in __parseContentStream(self, stream)
2771 peek = stream.read(1)
2772 else:
-> 2773 operands.append(readObject(stream, None))
2774
2775 def _readInlineImage(self, stream):
~\Anaconda\lib\site-packages\PyPDF4\generic.py in readObject(stream, pdf)
75 elif idx == 5:
76 # string object
---> 77 return readStringFromStream(stream)
78 elif idx == 6:
79 # null object
~\Anaconda\lib\site-packages\PyPDF4\generic.py in readStringFromStream(stream)
332 if not tok:
333 # stream has truncated prematurely
--> 334 raise PdfStreamError("Stream has ended unexpectedly")
335 if tok == b_("("):
336 parens += 1
PdfStreamError: Stream has ended unexpectedly
extractText() works but seems to struggle handling particular things inside the text.
I have Python 3.11 and pypdf installed.
pip freeze
pypdf==4.1.0
In case others struggle with the same task. Here's what worked for me with a correct pdf: Here's the documentation
from pypdf import PdfReader reader = PdfReader('zen_of_python_corrupted.pdf') for pagenum in range(reader._get_num_pages()): page = reader.pages[pagenum] text = page.extract_text() print(text)