pypdf
pypdf copied to clipboard
Failing to extract Metadata from Research Paper
Hi I am trying to extract details from publications pdfs using getDocumentInfo().
from PyPDF2 import PdfFileReader
f1 = PdfFileReader(open("./zac2343.pdf", "rb"))
f1.getDocumentInfo()
However, I am getting most of the fields empty as followings
{"/Author": "",
"/CreationDate": "D:20131213120440Z00'00'",
"/Creator": "XPP",
"/Keywords": "",
"/ModDate": "D:20131213120440Z00'00'",
"/Producer": "",
"/Subject": "",
"/Title": ""}
I also get a warning "PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]"
What am I doing wrong?
I am facing similar issue This is my code:
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
and this is the error I get:
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Traceback (most recent call last):
File "F:/projects/classyPDF-master/Papers_subrata/readpdf2.py", line 43, in <module>
text += pageObj.extractText()
File "F:\python3\lib\site-packages\PyPDF2\pdf.py", line 2595, in extractText
content = ContentStream(content, self.pdf)
File "F:\python3\lib\site-packages\PyPDF2\pdf.py", line 2670, in __init__
data += s.getObject().getData()
File "F:\python3\lib\site-packages\PyPDF2\generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 350, in decodeStreamData
data = LZWDecode.decode(data, stream.get("/DecodeParms"))
File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 255, in decode
return LZWDecode.decoder(data).decode()
File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 228, in decode
cW = self.nextCode();
File "F:\python3\lib\site-packages\PyPDF2\filters.py", line 205, in nextCode
nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found
Any help is appreciated. I am running on Python3.
Do you have an example PDF that has this issue?
You may download from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4356779/pdf/zac2343.pdf
Faced this issue in back in 2019. Reran the script with updated PyPDF2 and got just the warning message this time and no errors this time. I have forwarded a sample pdf file to your email available on your GitHub profile for your reference.
The below is the warning.
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]
@SubrataSarkar32 The (very minor) warning is displayed when using PdfFileReader insteam of PdfReader (Strict param default value will be False in this case)
@codemeleon your file is working successfully with the 2.10.5 (in progress)
@MartinThoma This issue should be closed
Thank you for taking care of it @pubpub-zz :pray:
PyPDF2==2.10.5
was just released :-)