pypdf Internal links fail to work after page is merged in

I'm appending one PDF to another, and internal links in the second PDF fail to work after the merge.

Environment

$ python -m platform
macOS-12.7.1-x86_64-i386-64bit-Mach-O

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.5.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.2.1

Code + PDF

If I run this code I can reproduce the output. The two input PDFs are attached below.

import pypdf

input_file = 'input.pdf'
merge_file = 'tst.pdf'
output_file = 'output.pdf'

src_reader = pypdf.PdfReader(input_file)
target = pypdf.PdfWriter(clone_from = input_file)

merge_reader = pypdf.PdfReader(merge_file)

for page in merge_reader.pages:
    current_page = page.create_blank_page(
        target,
        width = page.mediabox.width,
        height = page.mediabox.height
    )
    current_page.merge_page(page)
    target.add_page(current_page)

target.write(output_file)

input.pdf tst.pdf

Further thoughts

If I skip the create_blank_page and merge_page and just instead do target.add_page(page) I get the same problem.

To debug the issue I wrote a little link dumper and here's what it produces on the output:

--- page 1 : IndirectObject(3, 0, 4552833552)
--- page 2 : IndirectObject(18, 0, 4552833552)
--- page 3 : IndirectObject(20, 0, 4552833552)
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
--- page 4 : IndirectObject(35, 0, 4552833552)
--- page 5 : IndirectObject(37, 0, 4552833552)
--- page 6 : IndirectObject(199, 0, 4552833552)
/Dest: [IndirectObject(214, 0, 4552833552), '/XYZ', 56.69292, 786.1971, 0], page: UNFOUND!!
--- page 7 : IndirectObject(220, 0, 4552833552)
--- page 8 : IndirectObject(221, 0, 4552833552)

The page we care about is page 6. As you can see, the /Dest of the link is a page object 214, but there is no such page object.

I've dug into the code and merge_page as far as I can see just copies all annotations from the source, which means that the link /Dest objects will be the same as they were before, but the new PDF will presumably consist of all new pages? So I don't understand how this was intended to work. If I can get some guidance on what the thinking is I can try to work on this myself.

May 20 '25 13:05 larsga

Thanks for the report.

So I don't understand how this was intended to work.

I can just guess, but it seems like nobody stumbled upon this and cared about this for now. The goal should probably be to merge these properly as well as far as possible.

If I can get some guidance on what the thinking is I can try to work on this myself.

I appreciate any PRs which improve the current situation.

May 20 '25 13:05 stefan6419846

I am fixing this in our software, outside of PyPDF now first, but I will try to make a PyPDF PR afterwards.

Here is my current understanding, which will guide the implementation of the PR. Feedback welcome.

PageObject.merge: This operation just copies the links without making any changes to them. PdfWriter.add_page: The added page is cloned, causing objects (links and pages) to change identity.

I tried reading the source code of PdfWriter.add_page and as far as I can tell nothing is done about links there. Instead, it seems the page structures are just cloned and added to the new PDF without any restructuring. It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.

The full complexity here is substantial, because both of the operations above are relevant, so the full implementation needs to cover a number of different cases and it needs to collect information first, then act on it later. Let's picture case, just to get a sense of what's going on.

We are writing to a new PDF, called T, pages T1, T2, ... We are merging PDFs A and B into T, and new pages are created by:

create new empty page X1,
merge into it page A1,
merge into it page B1,
then add it to T as T1 (the page gets cloned, so X1 becomes T1).

Now, let's say that Ax has a link to Ay, which is further out in A. There are three relevant operations for us:

Ax is merged into Xx. The link refers to Ay, which at this point has not been added to the document (and may in theory never be added), so there is no way to know where the link should point in the new document.
Xx is added to T as Tx. Ay still hasn't been added, still no way to resolve the link.
T is written to file. At this point, we can tell whether Ay was ever added, and fix the links before saving.

If we are going to implement this we will need to:

In each PageObject, track the identity of the pages that were merged into this page. (This is how we know that Ty had Ay merged into it, so we can tell that the link refers there. If Ay gets merged multiple times, the link will refer to one of the copies. If Ay never gets merged, the link will be broken, as it must be.)
In PdfWriter.add_page record the links that will need to be resolved at saving time.
In PdfWriter.write walk through the links that need to be resolved, and use the tracked page identities to work out if the page the link refers to has been added to the document. If it has, make the link point to the new page. (For direct references we simply patch the ArrayObject that holds the reference as the first element (before /XYZ), while for named references we call PdfWriter.add_named_destination to make the name point to the desired page.)

I think this will work and will cover all the corner cases, although it will probably be relatively complex. Again, feedback on this plan very much welcome, so I don't spend time implementing something that will not work, or will be rejected.

May 21 '25 08:05 larsga

I have to admit that I do not have a complete overview of all the code. It mostly is learning by doing/fixing for me as well. As long as a change improves the current situation and ideally is not too complex for review, chances are high that it gets merged. This being said, we might want to tackle this in smaller parts where possible/feasible.

It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.

As most of the time, without actually looking at the specific details of the files this is just guessing. Maybe it is just a coincidence that it keeps working out of the box due to how the object numbers are defined in the different documents.

May 21 '25 09:05 stefan6419846

Thanks. I will do what I can to keep complexity down. I'm not sure it's possible to break this up much, but we could drop tracking of page identities from the first iteration. That version will not work for us, but it may be a good idea, anyway, to make overall review faster.

May 21 '25 09:05 larsga

https://github.com/py-pdf/pypdf/blob/15c42ffac67f0626836b01f9d60a73e5416b71ad/pypdf/_page.py#L1151-L1155

Do the annotations of page2 need to be changed? We need the destination after the merge not before?

May 21 '25 13:05 j-t-1

@j-t-1 I'm not sure if there is any point in changing the links there. All that's happened is that one page has been merged into another. We don't yet know what PDF this will be written to in the end. And where the links should go depends on what writer the page is eventually added to and finally written out.

So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.

I have a draft PR of that, but getting side-tracked by other bugs, so I haven't been able to finish it yet.

May 21 '25 20:05 larsga

@larsga I reread what you wrote.

PageObject.merge: This operation just copies the links without making any changes to them.

So I think my comment was just saying something you had already said! Thanks for highlighting this.

So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.

Definitely seems way forward.

_merge_page should have a comment that the new annotation added to the page may not work. I will read the specification as unsure if this is an issue for Link annotations only.

May 21 '25 20:05 j-t-1

after a quick look:

in your example you are just trying to merge pages from the same document : I'd rather use the merge() / append() to add all the pages at once and where the links will not be broken
the link is broken because the pointed page can not be found;
when you use the merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page.
When you use the add_page, instead of using the standard cloning process that should prevent duplication the existing entry is deleted: https://github.com/py-pdf/pypdf/blob/15c42ffac67f0626836b01f9d60a73e5416b71ad/pypdf/_writer.py#L475-L482

this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.

As said in the comment in the code we have to be carefull. If we want to keep this approach - not my recommendation - I would more likely add an argument "ignore_duplicates" with a default value to false to keep the existing behavior when calling "add_page": in your original example, if I insert the pointed page many times, where should I jump to ?

May 22 '25 08:05 pubpub-zz

in your example you are just trying to merge pages from the same document : I'd rather use the merge() / append() to add all the pages at once and where the links will not be broken

The example does not reflect what the code actually does. The real workflow is:

create empty page,
merge page from source PDF 1 into it,
merge page from source PDF 2 into it,
then add to writer.

I'd love to skip the first step, but if I do PyPDF produces broken output. Of course, I'd love to figure out why, and report the bug, but we just recently adopted PyPDF in production, and now I spend all my time chasing different PyPDF bugs and fixing them or coming up with workarounds. So what gets reported to the project is just a subset of what I'm dealing with. I hope in time to be able to report everything and smooth out all the wrinkles, but for now I'm just sprinting all the time.

the link is broken because the pointed page can not be found

Sure, but the page pointed to gets added later, so it's perfectly possible to get this to work. PDFTools, a commercial tool, does it.

when you use the merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page

I have fixed it already, but haven't had time to finish the PR yet. I will submit as soon as I have time. See the long comment for an explanation of how I fix it.

When you use the add_page, instead of using the standard cloning process that should prevent duplication

I don't understand what you mean. I use add_page, yes. What are you suggesting I do instead?

this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.

I don't think this is correct. The link refers to the indirect object of a page in a different PDF, so the link has to be rewritten to point to the new page in this PDF in order for the link to work.

May 22 '25 09:05 larsga

Attaching this because I need it for automated tests.

MinimalJob.pdf

May 27 '25 11:05 larsga

Hi all, I just stumbled over this issue because we here at arXiv (arxiv.org) need to do concatenation of pdfs, and preserve internal links.

I know that gs (ghostscript) works with gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=out.pdf in1.pdf in2.pdf, but that can be very slow on heavy images.

The next case we have is pdftk in1.pdf in2.pdf cat output out.pdf which also works, but is a Java program which we would prefer not to use in our code base.

I tried also pqdf - failed to preserve links.

Now for pypdf, in principle what I did is:

    from pypdf import PdfWriter
    merger = PdfWriter()
    input1 = open("in1.pdf", "rb")
    input2 = open("in2.pdf", "rb")
    merger.append(input1)
    merger.append(input2)
    output = open("out.pdf", "wb")
    merger.write(output)
    merger.close()
    output.close()

and links are preserved, but not all. In particular, if the link target name is the same, the second one is dropped.

An example input are the following two latex files a.tex and b.tex

\documentclass{article}
\usepackage{hyperref}
\begin{document}
Hello World this is \ref{some-ref}.

Now let us go to the next page.
\clearpage
\section{Nice Section}\label{some-ref}

Byebye
\end{document}

and b.tex

\documentclass{article}
\usepackage{hyperref}
\begin{document}
Document B Hello World this is \ref{some-refB}.

Now let us go to the next page.
\clearpage
\section{Nice Section in B}\label{some-refB}

Byebye
\end{document}

Both have a reference to "Section 1" on the second page of the respective document.

Looking into the single pdfs (uncompressed) I see that both use section.1 as link identifier.

Looking into the pdf created by pypdf, there is still section.1 but only one, and both hyperrefs jump to the same section 1 on page 2 of the combined document.

It would be nice if this special case could also be fixed.

Thanks for your work on pypdf!

Jun 03 '25 21:06 norbusan

Looking into the pdf created by pypdf, there is still section.1 but only one, and both hyperrefs jump to the same section 1 on page 2 of the combined document.

While this is related, this is not exactly the issue initially reported, although something which possibly becomes quite tricky to solve without too much impact/changes to the documents.

Jun 04 '25 08:06 stefan6419846

I have started work on the second part of the PR, but I'm running into a problem.

The goal is to make link rewriting work even when the source page that the link comes from is merged into another PageObject before the PageObject is written to the PdfWriter.

To do that I have to tracked merged-in pages, which is easy enough. The problem is that for named references I then have to figure out which of the source PDFs it's coming from. I've made that also work, but there are two issues:

The source PDF file may be closed. I think we can handle this by saying that in this case the link ends up being broken.
Some of the tests become much slower. This one I have real difficulties with.

Before my change, these tests are very fast, but now they take more than 10 seconds each. The entire test_merger.py runs in about 6 seconds if I remove my link rewriting, so clearly there's a big performance hit here:

18.00s call     tests/test_merger.py::test_sweep_recursion2_with_writer[https://github.com/user-attachments/files/18381700/tika-924794.pdf-tika-924794.pdf]
15.37s call     tests/test_merger.py::test_sweep_recursion2_with_writer[https://github.com/user-attachments/files/18381697/tika-924546.pdf-tika-924546.pdf]
15.31s call     tests/test_merger.py::test_sweep_recursion1_with_writer
12.71s call     tests/test_merger.py::test_articles_with_writer

I really can't see why, though. Any ideas?

Here is the diff.

Jul 22 '25 12:07 larsga

I guess that parsing the links for these larger files (tika-924794.pdf has 199 pages, tika-924546.pdf has 68 pages, tika-924666.pdf has 192 pages) already is responsible for the slowdown. To check this, you might want to start with timing how long the retrieval step for these files would take and how many links this actually involves.

If this does not help, we might have to do a timing breakdown of a suitable granularity to identify the slow spots. In the worst case, the required functionality can only run iff the user sets an explicit flag due to the current side effects.

Jul 22 '25 12:07 stefan6419846

Okay, so I decided to look into one of these test cases just to see what was going on.

My full script is:

from pypdf import PdfReader, PdfWriter

name = "tika-924546.pdf"
reader = PdfReader(name)
merger = PdfWriter()
merger.append(reader)
merger.write('out.pdf')
merger.close()

reader2 = PdfReader('out.pdf')
reader2.pages

A little testing quickly shows the append() call is what takes time. I used cProfiler and got:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   39.500   39.500 {built-in method builtins.exec}
        1    0.000    0.000   39.500   39.500 <string>:1(<module>)
        1    0.000    0.000   39.500   39.500 _writer.py:2704(append)
        1    0.002    0.002   39.499   39.499 _writer.py:2772(merge)
      192    0.000    0.000   38.633    0.201 _writer.py:570(add_page)
      192    0.004    0.000   38.632    0.201 _writer.py:467(_add_page)
      192    0.001    0.000   38.448    0.200 _link.py:99(extract_links)
      758    0.002    0.000   38.447    0.051 _link.py:113(_build_link)
      703    0.002    0.000   38.208    0.054 _link.py:131(_create_link)
      691    0.002    0.000   38.207    0.055 _link.py:44(__init__)
      691    0.742    0.001   38.194    0.055 _link.py:65(_find_page_in)
      692    0.000    0.000   37.504    0.054 _doc_common.py:414(named_destinations)
11764/692    1.734    0.000   37.504    0.054 _doc_common.py:453(_get_named_destinations)

Note that nearly all the time is spent in _doc_common.py:453(_get_named_destinations), and apparently it's being called 692 times. So apparently for every link in the document the named destinations are being parsed out of the document all over again. _doc_common.py:935(_build_destination) is being called 690,815 times. Of course that is slow, but this seems like maybe a performance bug in PyPDF?

I suppose I could maintain some sort of cache of the named destinations for all merged-in documents but that seems a bit excessive.

Thoughts?

Jul 24 '25 12:07 larsga

We clearly need to cache named_destinations in some reliable fashion here - otherwise we end up with a big performance issue.

Doing a quick check, it seems like we already determine these links on the main branch, the performance issues luckily are not this big yet and only become obvious with the additional changes.

Jul 24 '25 14:07 stefan6419846

I did a small test: I added simple caching of the named destinations, and then my test script completes in 1.1 seconds. Given that I think the PR I have is good to go, even though it causes a substantial performance slowdown. I think the caching should be implemented separately, since it's really a separate change. I'm opening the PR now and we can discuss.

Jul 28 '25 10:07 larsga

pypdf pypdf copied to clipboard

Internal links fail to work after page is merged in

Environment

Code + PDF

Further thoughts

pypdf
pypdf copied to clipboard