pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Error while reading bookmarks/outlines "TypeError: argument of type 'NoneType' is not iterable"

Open hassanseoul123 opened this issue 1 year ago • 4 comments

"TypeError: argument of type 'NoneType' is not iterable" Got this when I tried to read the outlines of a PDF file with PdfReader.outlines.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.1

Code + PDF

Example PDF file: sample.pdf (Yes, you can use this file for tests)

from PyPDF2 import PdfReader
reader = PdfReader("sample.pdf")
print(reader.outlines)

Traceback

This is the complete Traceback I see:

C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py:1089: PdfReadWarning: Object 86 0 not defined.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\Hassan\Desktop\main.py", line 3, in <module>
    outlines = reader.outlines
  File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 674, in outlines
    return self._get_outlines()
  File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 694, in _get_outlines
    if "/First" in lines:
TypeError: argument of type 'NoneType' is not iterable

hassanseoul123 avatar Jul 05 '22 03:07 hassanseoul123

Thank you for reporting the issue :heart:

PdfReader.outlines is the one you should use. The others do the same thing, but they are deprecated (see CHANGELOG)

MartinThoma avatar Jul 05 '22 05:07 MartinThoma

The PDF is non standard-compliant. You can see a warning

PdfReadWarning: Object 86 0 not defined

Via https://demo.verapdf.org/ you can see several issues... I'm not certain, though, if they are connected to the problem you face. I think the xref table might just be wrong. I don't know how atril can recover the outlines from it.

MartinThoma avatar Jul 05 '22 08:07 MartinThoma

Yes, the problem in this file is the xref objects.

The way PyPDF2 reads pdfs is it essentially searches for the xref table and parses it. It then uses additional dictionaries within the file in conjunction with the xref table to locate the various objects at their byte location.

For the outline in this example, PyPDF2 processes the /Trailer which points to the document /Root. Root points to the Outline dictionary at object 86 (id number) 0 (generation number). This object is missing. This object (the Outline Dictionary) is supposed to point to the First and Last children (outline items) and is used as the starting point to build the outline tree. The Outline Dictionary exists within the document, just at a different location (i.e., not at 86 0 R). Fixing such an issue is possible with some commercially available PDF software renderers, such as Adobe Acrobat or PDF XChange. However, from what I can tell, fixing such an issue is currently beyond PyPDF2's "plug-n-play" capabilities. I think it could be done with some one-off code specifically for this situation. However, it is probably easiest to simply open and re-save the document in Adobe Acrobat.

For the code base, we could consider adding some logic to the PdfReader code within _get_outlines() method such that if the /Catalog contains a reference to the /Outlines dictionary, but the reference is missing from the xref table, to manually parse the document's objects and attempt to infer it from the attributes defined in Table 152 of PDFv1.7 specification, then update the /Catalog/Outlines pointer value. That would probably be best implemented as part of a larger framework to handle misplaced and/or unreferenced objects rather than a one-off endeavor for this particular niche-bug.

mtd91429 avatar Jul 19 '22 19:07 mtd91429

Outlines chromes can extract:

image

MartinThoma avatar Jul 23 '22 06:07 MartinThoma

Retested with Latest dev version (2.10.4+ / 5?) in progress Same results as Chrome can be observed. The objects 86 and 88 can be retrieved successfully.

@MartinThoma, this issue should be closed

pubpub-zz avatar Sep 04 '22 10:09 pubpub-zz

+1?

pubpub-zz avatar Sep 06 '22 20:09 pubpub-zz

Thank you!

MartinThoma avatar Sep 07 '22 16:09 MartinThoma