pypdf
pypdf copied to clipboard
Error while reading bookmarks/outlines "TypeError: argument of type 'NoneType' is not iterable"
"TypeError: argument of type 'NoneType' is not iterable"
Got this when I tried to read the outlines of a PDF file with PdfReader.outlines
.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-10-10.0.19044-SP0
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.1
Code + PDF
Example PDF file: sample.pdf (Yes, you can use this file for tests)
from PyPDF2 import PdfReader
reader = PdfReader("sample.pdf")
print(reader.outlines)
Traceback
This is the complete Traceback I see:
C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py:1089: PdfReadWarning: Object 86 0 not defined.
warnings.warn(
Traceback (most recent call last):
File "C:\Users\Hassan\Desktop\main.py", line 3, in <module>
outlines = reader.outlines
File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 674, in outlines
return self._get_outlines()
File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 694, in _get_outlines
if "/First" in lines:
TypeError: argument of type 'NoneType' is not iterable
Thank you for reporting the issue :heart:
PdfReader.outlines
is the one you should use. The others do the same thing, but they are deprecated (see CHANGELOG)
The PDF is non standard-compliant. You can see a warning
PdfReadWarning: Object 86 0 not defined
Via https://demo.verapdf.org/ you can see several issues... I'm not certain, though, if they are connected to the problem you face. I think the xref table might just be wrong. I don't know how atril can recover the outlines from it.
Yes, the problem in this file is the xref objects.
The way PyPDF2 reads pdfs is it essentially searches for the xref table and parses it. It then uses additional dictionaries within the file in conjunction with the xref table to locate the various objects at their byte location.
For the outline in this example, PyPDF2 processes the /Trailer which points to the document /Root. Root points to the Outline dictionary at object 86 (id number) 0 (generation number). This object is missing. This object (the Outline Dictionary) is supposed to point to the First and Last children (outline items) and is used as the starting point to build the outline tree. The Outline Dictionary exists within the document, just at a different location (i.e., not at 86 0 R). Fixing such an issue is possible with some commercially available PDF software renderers, such as Adobe Acrobat or PDF XChange. However, from what I can tell, fixing such an issue is currently beyond PyPDF2's "plug-n-play" capabilities. I think it could be done with some one-off code specifically for this situation. However, it is probably easiest to simply open and re-save the document in Adobe Acrobat.
For the code base, we could consider adding some logic to the PdfReader code within _get_outlines()
method such that if the /Catalog contains a reference to the /Outlines dictionary, but the reference is missing from the xref table, to manually parse the document's objects and attempt to infer it from the attributes defined in Table 152 of PDFv1.7 specification, then update the /Catalog/Outlines pointer value. That would probably be best implemented as part of a larger framework to handle misplaced and/or unreferenced objects rather than a one-off endeavor for this particular niche-bug.
Outlines chromes can extract:
Retested with Latest dev version (2.10.4+ / 5?) in progress Same results as Chrome can be observed. The objects 86 and 88 can be retrieved successfully.
@MartinThoma, this issue should be closed
+1?
Thank you!