pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Failing to extract Metadata from Research Paper

Open codemeleon opened this issue 6 years ago • 4 comments

Hi I am trying to extract details from publications pdfs using getDocumentInfo().

from PyPDF2 import PdfFileReader
f1 = PdfFileReader(open("./zac2343.pdf", "rb")) 
f1.getDocumentInfo() 

However, I am getting most of the fields empty as followings

{"/Author": "", 
 "/CreationDate": "D:20131213120440Z00'00'",
 "/Creator": "XPP",
 "/Keywords": "",
 "/ModDate": "D:20131213120440Z00'00'",
 "/Producer": "",
 "/Subject": "",
 "/Title": ""}

I also get a warning "PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]"

What am I doing wrong?

codemeleon avatar Apr 02 '18 13:04 codemeleon

I am facing similar issue This is my code:

while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()

and this is the error I get:

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Traceback (most recent call last):
  File "F:/projects/classyPDF-master/Papers_subrata/readpdf2.py", line 43, in <module>
    text += pageObj.extractText()
  File "F:\python3\lib\site-packages\PyPDF2\pdf.py", line 2595, in extractText
    content = ContentStream(content, self.pdf)
  File "F:\python3\lib\site-packages\PyPDF2\pdf.py", line 2670, in __init__
    data += s.getObject().getData()
  File "F:\python3\lib\site-packages\PyPDF2\generic.py", line 841, in getData
    decoded._data = filters.decodeStreamData(self)
  File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 350, in decodeStreamData
    data = LZWDecode.decode(data, stream.get("/DecodeParms"))
  File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 255, in decode
    return LZWDecode.decoder(data).decode()
  File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 228, in decode
    cW = self.nextCode();
  File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 205, in nextCode
    nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found

Any help is appreciated. I am running on Python3.

SubrataSarkar32 avatar Jan 18 '19 18:01 SubrataSarkar32

Do you have an example PDF that has this issue?

MartinThoma avatar Apr 07 '22 14:04 MartinThoma

You may download from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4356779/pdf/zac2343.pdf

codemeleon avatar Apr 07 '22 20:04 codemeleon

Faced this issue in back in 2019. Reran the script with updated PyPDF2 and got just the warning message this time and no errors this time. I have forwarded a sample pdf file to your email available on your GitHub profile for your reference. The below is the warning. PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]

SubrataSarkar32 avatar Apr 08 '22 17:04 SubrataSarkar32

@SubrataSarkar32 The (very minor) warning is displayed when using PdfFileReader insteam of PdfReader (Strict param default value will be False in this case)

@codemeleon your file is working successfully with the 2.10.5 (in progress)

@MartinThoma This issue should be closed

pubpub-zz avatar Sep 03 '22 15:09 pubpub-zz

Thank you for taking care of it @pubpub-zz :pray:

PyPDF2==2.10.5 was just released :-)

MartinThoma avatar Sep 04 '22 15:09 MartinThoma