PyPDF4
PyPDF4 copied to clipboard
extractText() in PyPDF4 not working while working in PyPDF2
I have the following file: zen_of_python_corrupted.pdf According to the PDF's internal code the text content is somehow corrupted/compressed/differently encoded. However it works fine when opened with a PDF viewer.
Now I want to extract the text in Python. With PyPDF2 it looks like this:
import PyPDF2
reader = PyPDF2.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.getNumPages()):
page = reader.getPage(pagenum)
text = page.extractText()
print(text)
And indeed it prints me the Zen of Python.
With PyPDF4 it is:
import PyPDF4
reader = PyPDF4.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.numPages):
page = reader.getPage(pagenum)
text = page.extractText()
print(text)
But there I only get:
Error -3 while decompressing data: incorrect data check
Since I don't find this particular error message within the code of PyPDF4 I consider that the error lies within a third party library. But still I find it odd that it works on the older PyPDF2. Do you have any idea about this? Does it work on your systems if you try it out?