PyPDF4 icon indicating copy to clipboard operation
PyPDF4 copied to clipboard

extractText() in PyPDF4 not working while working in PyPDF2

Open michd89 opened this issue 5 years ago • 2 comments

I have the following file: zen_of_python_corrupted.pdf According to the PDF's internal code the text content is somehow corrupted/compressed/differently encoded. However it works fine when opened with a PDF viewer.

Now I want to extract the text in Python. With PyPDF2 it looks like this:

import PyPDF2
reader = PyPDF2.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.getNumPages()):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

And indeed it prints me the Zen of Python.

With PyPDF4 it is:

import PyPDF4
reader = PyPDF4.PdfFileReader('zen_of_python_corrupted.pdf')
for pagenum in range(reader.numPages):
    page = reader.getPage(pagenum)
    text = page.extractText()
    print(text)

But there I only get: Error -3 while decompressing data: incorrect data check

Since I don't find this particular error message within the code of PyPDF4 I consider that the error lies within a third party library. But still I find it odd that it works on the older PyPDF2. Do you have any idea about this? Does it work on your systems if you try it out?

michd89 avatar Mar 04 '19 13:03 michd89