pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Microsoft Word table of contents Link annotation error.

Open vokson opened this issue 6 months ago • 0 comments

I am trying to use PdfReader and PdfWriter to read/write annotations in pdf file. I use PDF file produced by Microsoft Word -> Save As PDF. Word file has 3 simple pages with headings Page 1, Page 2, Page 3 and automatic table of contents made from these headings. Links in table of contents become to be Link annotations in PDF file. Annotation itself looks like this

{'/Subtype': '/Link', '/Rect': [82.8, 711.57, 554.55, 731.07], '/BS': {'/W': 0}, '/F': 4, '/Dest': [IndirectObject(3, 0, 1202232362752), '/XYZ', 82, 785, 0], '/StructParent': 3}

Problem is value of '/Dest' key is list, but your code in _writer.py always expects dictionary. Then program tries to get value of tmp["target_page_index" from list, so that crash with error.

Please, help.

      if to_add.get("/Subtype") == "/Link" and "/Dest" in to_add:
            tmp = cast(Dict[Any, Any], to_add[NameObject("/Dest")])
            dest = Destination(
                NameObject("/LinkName"),
                tmp["target_page_index"],
                Fit(
                    fit_type=tmp["fit"], fit_args=dict(tmp)["fit_args"]
                ),  # I have no clue why this dict-hack is necessary
            )
            to_add[NameObject("/Dest")] = dest.dest_array

Environment

$ python -m platform
Windows-10-10.0.19043-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.2, crypt_provider=('cryptography', '37.0.4'), PIL=9.4.0

Code + PDF

    annotations = {}
    writer = PdfWriter()
    in_memory_file = BytesIO()

    for filename in filenames:
        reader = PdfReader(filename, strict=False)
        for page_idx, page in enumerate(reader.pages):
            if "/Annots" in page:
                for annot in page["/Annots"]:
                    if not annotations.get(page_idx):
                        annotations[page_idx] = []

                    annotations[page_idx].append(annot.get_object())
        del reader

    reader = PdfReader(filenames[0])
    for page_idx, page in enumerate(reader.pages):
        writer.add_page(page)

    del reader
    writer.remove_links()

    for page_idx in annotations:
        for annot in annotations[page_idx]:
            writer.add_annotation(page_number=page_idx, annotation=annot)

    writer.write(in_memory_file)

Test.docx Test.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\message_bus.py", line 537, in handle_command
    result = handler(command, self._uow, self.handle)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\entrypoints\..\logic\service_layer\command_handlers\command_service_handlers.py", line 929, in mix_pdf_files
    writer.add_annotation(page_number=page_idx, annotation=annot)
  File "C:\NOSKOV\030_DEV\web_services\skotch3\src\backend\venv\lib\site-packages\pypdf\_writer.py", line 2803, in add_annotation
    tmp["target_page_index"],
TypeError: list indices must be integers or slices, not str

vokson avatar Dec 15 '23 11:12 vokson