pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

BUG: PdfMerger silently fails to add invalid outline items to merged document

Open mtd91429 opened this issue 1 year ago • 1 comments

When merging two documents via the PdfMerger class, outline items with invalid destinations are not added to the merged document.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.8.0

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader, PdfMerger

doc1 = PdfReader(r"resources\outlines-with-invalid-destinations.pdf")
doc2 = PdfReader(r"resources\pdflatex-outline.pdf")

merger = PdfMerger()
merger.append(doc1, import_outline=True)
merger.append(doc2, import_outline=True)

One can see the issue if looking at the length of the resulting outline:

>>>print(len(doc1.outline))
9
>>>print(len(doc2.outline))
9
>>>print(len(merger.outline))
12

The two PDFs used in the testing are already available in the repository. The outline items which are not transferred to the merged document are located in "outlines-with-invalid-destinations.pdf" document.

mtd91429 avatar Jul 29 '22 18:07 mtd91429

The current behavior is a bit more complicated. While looking at some of the cases of malformed destinations "out in the wild" which we use for testing, outline items with malformed destinations which also have children outline items are retained.

For example, using the tika-924546.pdf which has 2 outline items lacking a destination but which serve to organize child outline items (namely, "Figures" and "Tables"), the outlines are successfully merged:

>>>from PyPDF2 import PdfReader, PdfMerger
>>>
>>>doc = PdfReader("tika-924546.pdf")
>>>print(len(doc.outline))
27
>>>merger = PdfMerger()
>>>merger.append(doc, import_outline=True)
>>>merger.append(doc, import_outline=True)
>>>print(len(merger.outline))
54

Another (more complicated) example is tika-933322.

For this issue, it the behavior stems from the _trim_outline method of the PdfMerger class. Intuitively, this removes outlines which reference a page outside of the merged pages OR an outline item which has a null reference. However, for outline items with null destinations embedded within a complex hierarchy, it results in unanticipated behavior.

I'm thinking that the simplest way to address this issues is to generate a new kwarg which is to retain all outline items in the merged document, regardless of their destination. The default would be False as I think in most cases, this is desired (i.e., trim away leading and trailing outline items no longer part of the page range). If True is passed, all outline items would be added; for those with pages outside of the merged document, the destination reference would be changed to Null. But this way, for complicated examples like tika-933322.pdf, the end user could ensure the complicated tree architecture is retained and subsequently fix it rather than attempting to re-generate it (which is likely more work).

mtd91429 avatar Jul 29 '22 20:07 mtd91429

@mtd91429 , can you check with PdfWriter and its new append capability and provide status ?

pubpub-zz avatar Feb 08 '23 21:02 pubpub-zz

without feedback I close this as fixed, feel free to provide update to reopen it if necessary

pubpub-zz avatar Feb 26 '23 14:02 pubpub-zz