pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

BUG: calling `get_{variable}_root` method of PdfWriter causes `ValueError` for documents with pre-existing variable

Open mtd91429 opened this issue 1 year ago • 5 comments

I was attempting to add outline items to a document which already had an outline.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.8.1

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader, PdfWriter

# import a PDF without an outline
reader1 = PdfReader(r"sample-files\004-pdflatex-4-pages\pdflatex-4-pages.pdf")
writer1 = PdfWriter()
writer1.clone_document_from_reader(reader1)
# add an outline item to the document
writer1.add_outline_item("First Outline Item", pagenum=2)
# save the document
with open("dont_commit_first.pdf", "wb") as output_file:
    writer1.write(output_file)

# re-open the document with the outline item added from above
reader2 = PdfReader("dont_commit_first.pdf")
writer2 = PdfWriter()
writer2.clone_document_from_reader(reader2)
# attempt to add an outline item to the document
writer2.add_outline_item("Another Outline Item", pagenum=3)

The PDF file in the example is already in the sample-files sub-repository.

Of note, this error will occur if the outlines were added by PyPDF2 or another source

Traceback

This is the complete Traceback I see:

Exception has occurred: ValueError
{'/First': IndirectObject(9, 0, 1781903553824), '/Count': 2, '/Last': IndirectObject(11, 0, 1781903553824)} is not in list
  File "C:\...\PyPDF2\PyPDF2\_writer.py", line 1009, in get_outline_root
    idnum = self._objects.index(outline) + 1
  File "C:\...\PyPDF2\PyPDF2\_writer.py", line 1209, in add_outline_item
    parent = self.get_outline_root()
  File "C:\...\PyPDF2\testing.py", line 19, in <module>
    writer2.add_outline_item("Another Outline Item", pagenum=4)

mtd91429 avatar Jul 30 '22 22:07 mtd91429

The bug is bigger than just the outline items. It seems to also involve named destinations as well.

To reproduce the error, use the following code:

from PyPDF2 import PdfReader, PdfWriter

# import a PDF without a named destination
reader1 = PdfReader(r"sample-files\004-pdflatex-4-pages\pdflatex-4-pages.pdf")
writer1 = PdfWriter()
writer1.clone_document_from_reader(reader1)
# get named destination root, which creates the object the first time it is called and doesn't find it
writer1.get_named_dest_root()
# save the document
with open("named_dest_root.pdf", "wb") as output_file:
    writer1.write(output_file)

# re-open the document with the named dest from above
reader2 = PdfReader("named_dest_root.pdf")
writer2 = PdfWriter()
writer2.clone_document_from_reader(reader2)

writer2.get_named_dest_root()

Traceback

Exception has occurred: ValueError
{'/Dests': IndirectObject(13, 0, 2229695489664)} is not in list
  File "C:\Users\...\PyPDF2\PyPDF2\_writer.py", line 1034, in get_named_dest_root
    idnum = self._objects.index(names) + 1
  File "C:\Users\...\PyPDF2\mtd_color_testing.py", line 19, in <module>
    writer2.get_named_dest_root()

The code breaks on the lines which attempt to get the ID number for the desired object. For example, for the named destinations, it uses idnum = self._objects.index(names) + 1 which is currently line 1034 of _writer.py

mtd91429 avatar Jul 31 '22 18:07 mtd91429

Is this a bug which we introduced with the recent changes or is that an issue PyPDF2 had for a longer time?

(I'm asking because I want to know if I should postpone the release)

MartinThoma avatar Jul 31 '22 18:07 MartinThoma

I think it is an issue that PyPDF2 has had for a very long time. None of the outline code changes I made actually touched the get_outline_root() method because this was the only method that happened to follow the naming convention. That was my initial worry as well. Additionally, it the issue seems to involve other parts of the code (for example, Named Destinations) which were not changed by the update. I'm trying to understand the reasoning behind the idnum = self._objects.index(outline) + 1 and the subsequent assertion check in order to issue a fix.

mtd91429 avatar Jul 31 '22 18:07 mtd91429

In further digging, I think this really has to do with the method clone_document_from_reader(). The method first clones the document Root (which, contains the catalog references to Named Destinations and Outlines). Then, it iterates over all of the document pages and adds them both to the PdfWriter._objects list and the page tree. For Named Destinations and Outlines, these are not added to the PdfWriter._objects list. They are still written to the final file, however, via the _sweep_indirect_references() method which ultimately adds them to the PdfWriter._objects list before writing the file output.

The problem arises because the methods get_named_dest_root() and get_outline_root() ultimately double check they've grabbed the correct object and use the PdfWriter._objects list which, prior to writing to file, has not added the outline or named destinations to the list. I like the double checking and don't want to remove that. My first thought is to add the _sweep_indirect_references() method to the clone_document_from_reader() method. I'll see how that goes and issue a PR if it seems to work.

mtd91429 avatar Jul 31 '22 19:07 mtd91429

That seemed to work. With this fix, the _sweep_indirect_references() method gets called twice when generating a PDF by cloning it from the PdfReader: first when cloned, second when it is written to file. This is a bit redundant but at least ensure that all objects within the document are accounted for within the PdfWriter object before manipulating it later.

This does make the method append_pages_from_reader() redundant; the page tree is still swept via the _sweep_indirect_references() method and are added to the _objects list. The final file that is written doesn't contain extra pages or larger file size; I think this is due to Python's namespaces, but I'm not sure where/if redundancies are pruned. The only aspect append_pages_from_reader() adds is the callback function.

mtd91429 avatar Jul 31 '22 20:07 mtd91429

@mtd91429 Can you check with PdfWriter.append()

pubpub-zz avatar Feb 08 '23 21:02 pubpub-zz

I close this PR as fixed. Feel free to ask to reopen int

pubpub-zz avatar Feb 26 '23 14:02 pubpub-zz