pypdf
pypdf copied to clipboard
Internal links fail to work after page is merged in
I'm appending one PDF to another, and internal links in the second PDF fail to work after the merge.
Environment
$ python -m platform
macOS-12.7.1-x86_64-i386-64bit-Mach-O
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.5.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.2.1
Code + PDF
If I run this code I can reproduce the output. The two input PDFs are attached below.
import pypdf
input_file = 'input.pdf'
merge_file = 'tst.pdf'
output_file = 'output.pdf'
src_reader = pypdf.PdfReader(input_file)
target = pypdf.PdfWriter(clone_from = input_file)
merge_reader = pypdf.PdfReader(merge_file)
for page in merge_reader.pages:
current_page = page.create_blank_page(
target,
width = page.mediabox.width,
height = page.mediabox.height
)
current_page.merge_page(page)
target.add_page(current_page)
target.write(output_file)
Further thoughts
If I skip the create_blank_page and merge_page and just instead do target.add_page(page) I get the same problem.
To debug the issue I wrote a little link dumper and here's what it produces on the output:
--- page 1 : IndirectObject(3, 0, 4552833552)
--- page 2 : IndirectObject(18, 0, 4552833552)
--- page 3 : IndirectObject(20, 0, 4552833552)
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
--- page 4 : IndirectObject(35, 0, 4552833552)
--- page 5 : IndirectObject(37, 0, 4552833552)
--- page 6 : IndirectObject(199, 0, 4552833552)
/Dest: [IndirectObject(214, 0, 4552833552), '/XYZ', 56.69292, 786.1971, 0], page: UNFOUND!!
--- page 7 : IndirectObject(220, 0, 4552833552)
--- page 8 : IndirectObject(221, 0, 4552833552)
The page we care about is page 6. As you can see, the /Dest of the link is a page object 214, but there is no such page object.
I've dug into the code and merge_page as far as I can see just copies all annotations from the source, which means that the link /Dest objects will be the same as they were before, but the new PDF will presumably consist of all new pages? So I don't understand how this was intended to work. If I can get some guidance on what the thinking is I can try to work on this myself.
Thanks for the report.
So I don't understand how this was intended to work.
I can just guess, but it seems like nobody stumbled upon this and cared about this for now. The goal should probably be to merge these properly as well as far as possible.
If I can get some guidance on what the thinking is I can try to work on this myself.
I appreciate any PRs which improve the current situation.
I am fixing this in our software, outside of PyPDF now first, but I will try to make a PyPDF PR afterwards.
Here is my current understanding, which will guide the implementation of the PR. Feedback welcome.
PageObject.merge: This operation just copies the links without making any changes to them.
PdfWriter.add_page: The added page is cloned, causing objects (links and pages) to change identity.
I tried reading the source code of PdfWriter.add_page and as far as I can tell nothing is done about links there. Instead, it seems the page structures are just cloned and added to the new PDF without any restructuring. It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.
The full complexity here is substantial, because both of the operations above are relevant, so the full implementation needs to cover a number of different cases and it needs to collect information first, then act on it later. Let's picture case, just to get a sense of what's going on.
We are writing to a new PDF, called T, pages T1, T2, ... We are merging PDFs A and B into T, and new pages are created by:
- create new empty page X1,
- merge into it page A1,
- merge into it page B1,
- then add it to T as T1 (the page gets cloned, so X1 becomes T1).
Now, let's say that Ax has a link to Ay, which is further out in A. There are three relevant operations for us:
- Ax is merged into Xx. The link refers to Ay, which at this point has not been added to the document (and may in theory never be added), so there is no way to know where the link should point in the new document.
- Xx is added to T as Tx. Ay still hasn't been added, still no way to resolve the link.
- T is written to file. At this point, we can tell whether Ay was ever added, and fix the links before saving.
If we are going to implement this we will need to:
- In each
PageObject, track the identity of the pages that were merged into this page. (This is how we know that Ty had Ay merged into it, so we can tell that the link refers there. If Ay gets merged multiple times, the link will refer to one of the copies. If Ay never gets merged, the link will be broken, as it must be.) - In
PdfWriter.add_pagerecord the links that will need to be resolved at saving time. - In
PdfWriter.writewalk through the links that need to be resolved, and use the tracked page identities to work out if the page the link refers to has been added to the document. If it has, make the link point to the new page. (For direct references we simply patch theArrayObjectthat holds the reference as the first element (before/XYZ), while for named references we callPdfWriter.add_named_destinationto make the name point to the desired page.)
I think this will work and will cover all the corner cases, although it will probably be relatively complex. Again, feedback on this plan very much welcome, so I don't spend time implementing something that will not work, or will be rejected.
I have to admit that I do not have a complete overview of all the code. It mostly is learning by doing/fixing for me as well. As long as a change improves the current situation and ideally is not too complex for review, chances are high that it gets merged. This being said, we might want to tackle this in smaller parts where possible/feasible.
It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.
As most of the time, without actually looking at the specific details of the files this is just guessing. Maybe it is just a coincidence that it keeps working out of the box due to how the object numbers are defined in the different documents.
Thanks. I will do what I can to keep complexity down. I'm not sure it's possible to break this up much, but we could drop tracking of page identities from the first iteration. That version will not work for us, but it may be a good idea, anyway, to make overall review faster.
https://github.com/py-pdf/pypdf/blob/15c42ffac67f0626836b01f9d60a73e5416b71ad/pypdf/_page.py#L1151-L1155
Do the annotations of page2 need to be changed? We need the destination after the merge not before?
@j-t-1 I'm not sure if there is any point in changing the links there. All that's happened is that one page has been merged into another. We don't yet know what PDF this will be written to in the end. And where the links should go depends on what writer the page is eventually added to and finally written out.
So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.
I have a draft PR of that, but getting side-tracked by other bugs, so I haven't been able to finish it yet.
@larsga I reread what you wrote.
PageObject.merge: This operation just copies the links without making any changes to them.
So I think my comment was just saying something you had already said! Thanks for highlighting this.
So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.
Definitely seems way forward.
_merge_page should have a comment that the new annotation added to the page may not work. I will read the specification as unsure if this is an issue for Link annotations only.
after a quick look:
- in your example you are just trying to merge pages from the same document : I'd rather use the
merge()/append()to add all the pages at once and where the links will not be broken - the link is broken because the pointed page can not be found;
- when you use the
merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page. - When you use the
add_page, instead of using the standard cloning process that should prevent duplication the existing entry is deleted: https://github.com/py-pdf/pypdf/blob/15c42ffac67f0626836b01f9d60a73e5416b71ad/pypdf/_writer.py#L475-L482
this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.
As said in the comment in the code we have to be carefull. If we want to keep this approach - not my recommendation - I would more likely add an argument "ignore_duplicates" with a default value to false to keep the existing behavior when calling "add_page": in your original example, if I insert the pointed page many times, where should I jump to ?
in your example you are just trying to merge pages from the same document : I'd rather use the merge() / append() to add all the pages at once and where the links will not be broken
The example does not reflect what the code actually does. The real workflow is:
- create empty page,
- merge page from source PDF 1 into it,
- merge page from source PDF 2 into it,
- then add to writer.
I'd love to skip the first step, but if I do PyPDF produces broken output. Of course, I'd love to figure out why, and report the bug, but we just recently adopted PyPDF in production, and now I spend all my time chasing different PyPDF bugs and fixing them or coming up with workarounds. So what gets reported to the project is just a subset of what I'm dealing with. I hope in time to be able to report everything and smooth out all the wrinkles, but for now I'm just sprinting all the time.
the link is broken because the pointed page can not be found
Sure, but the page pointed to gets added later, so it's perfectly possible to get this to work. PDFTools, a commercial tool, does it.
when you use the merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page
I have fixed it already, but haven't had time to finish the PR yet. I will submit as soon as I have time. See the long comment for an explanation of how I fix it.
When you use the add_page, instead of using the standard cloning process that should prevent duplication
I don't understand what you mean. I use add_page, yes. What are you suggesting I do instead?
this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.
I don't think this is correct. The link refers to the indirect object of a page in a different PDF, so the link has to be rewritten to point to the new page in this PDF in order for the link to work.
Hi all, I just stumbled over this issue because we here at arXiv (arxiv.org) need to do concatenation of pdfs, and preserve internal links.
I know that gs (ghostscript) works with gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=out.pdf in1.pdf in2.pdf, but that can be very slow on heavy images.
The next case we have is pdftk in1.pdf in2.pdf cat output out.pdf which also works, but is a Java program which we would prefer not to use in our code base.
I tried also pqdf - failed to preserve links.
Now for pypdf, in principle what I did is:
from pypdf import PdfWriter
merger = PdfWriter()
input1 = open("in1.pdf", "rb")
input2 = open("in2.pdf", "rb")
merger.append(input1)
merger.append(input2)
output = open("out.pdf", "wb")
merger.write(output)
merger.close()
output.close()
and links are preserved, but not all. In particular, if the link target name is the same, the second one is dropped.
An example input are the following two latex files a.tex and b.tex
\documentclass{article}
\usepackage{hyperref}
\begin{document}
Hello World this is \ref{some-ref}.
Now let us go to the next page.
\clearpage
\section{Nice Section}\label{some-ref}
Byebye
\end{document}
and b.tex
\documentclass{article}
\usepackage{hyperref}
\begin{document}
Document B Hello World this is \ref{some-refB}.
Now let us go to the next page.
\clearpage
\section{Nice Section in B}\label{some-refB}
Byebye
\end{document}
Both have a reference to "Section 1" on the second page of the respective document.
Looking into the single pdfs (uncompressed) I see that both use section.1 as link identifier.
Looking into the pdf created by pypdf, there is still section.1 but only one, and both hyperrefs jump to the same section 1 on page 2 of the combined document.
It would be nice if this special case could also be fixed.
Thanks for your work on pypdf!
Looking into the pdf created by pypdf, there is still
section.1but only one, and both hyperrefs jump to the same section 1 on page 2 of the combined document.
While this is related, this is not exactly the issue initially reported, although something which possibly becomes quite tricky to solve without too much impact/changes to the documents.
I have started work on the second part of the PR, but I'm running into a problem.
The goal is to make link rewriting work even when the source page that the link comes from is merged into another PageObject before the PageObject is written to the PdfWriter.
To do that I have to tracked merged-in pages, which is easy enough. The problem is that for named references I then have to figure out which of the source PDFs it's coming from. I've made that also work, but there are two issues:
- The source PDF file may be closed. I think we can handle this by saying that in this case the link ends up being broken.
- Some of the tests become much slower. This one I have real difficulties with.
Before my change, these tests are very fast, but now they take more than 10 seconds each. The entire test_merger.py runs in about 6 seconds if I remove my link rewriting, so clearly there's a big performance hit here:
18.00s call tests/test_merger.py::test_sweep_recursion2_with_writer[https://github.com/user-attachments/files/18381700/tika-924794.pdf-tika-924794.pdf]
15.37s call tests/test_merger.py::test_sweep_recursion2_with_writer[https://github.com/user-attachments/files/18381697/tika-924546.pdf-tika-924546.pdf]
15.31s call tests/test_merger.py::test_sweep_recursion1_with_writer
12.71s call tests/test_merger.py::test_articles_with_writer
I really can't see why, though. Any ideas?
I guess that parsing the links for these larger files (tika-924794.pdf has 199 pages, tika-924546.pdf has 68 pages, tika-924666.pdf has 192 pages) already is responsible for the slowdown. To check this, you might want to start with timing how long the retrieval step for these files would take and how many links this actually involves.
If this does not help, we might have to do a timing breakdown of a suitable granularity to identify the slow spots. In the worst case, the required functionality can only run iff the user sets an explicit flag due to the current side effects.
Okay, so I decided to look into one of these test cases just to see what was going on.
My full script is:
from pypdf import PdfReader, PdfWriter
name = "tika-924546.pdf"
reader = PdfReader(name)
merger = PdfWriter()
merger.append(reader)
merger.write('out.pdf')
merger.close()
reader2 = PdfReader('out.pdf')
reader2.pages
A little testing quickly shows the append() call is what takes time. I used cProfiler and got:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 39.500 39.500 {built-in method builtins.exec}
1 0.000 0.000 39.500 39.500 <string>:1(<module>)
1 0.000 0.000 39.500 39.500 _writer.py:2704(append)
1 0.002 0.002 39.499 39.499 _writer.py:2772(merge)
192 0.000 0.000 38.633 0.201 _writer.py:570(add_page)
192 0.004 0.000 38.632 0.201 _writer.py:467(_add_page)
192 0.001 0.000 38.448 0.200 _link.py:99(extract_links)
758 0.002 0.000 38.447 0.051 _link.py:113(_build_link)
703 0.002 0.000 38.208 0.054 _link.py:131(_create_link)
691 0.002 0.000 38.207 0.055 _link.py:44(__init__)
691 0.742 0.001 38.194 0.055 _link.py:65(_find_page_in)
692 0.000 0.000 37.504 0.054 _doc_common.py:414(named_destinations)
11764/692 1.734 0.000 37.504 0.054 _doc_common.py:453(_get_named_destinations)
Note that nearly all the time is spent in _doc_common.py:453(_get_named_destinations), and apparently it's being called 692 times. So apparently for every link in the document the named destinations are being parsed out of the document all over again. _doc_common.py:935(_build_destination) is being called 690,815 times. Of course that is slow, but this seems like maybe a performance bug in PyPDF?
I suppose I could maintain some sort of cache of the named destinations for all merged-in documents but that seems a bit excessive.
Thoughts?
We clearly need to cache named_destinations in some reliable fashion here - otherwise we end up with a big performance issue.
Doing a quick check, it seems like we already determine these links on the main branch, the performance issues luckily are not this big yet and only become obvious with the additional changes.
I did a small test: I added simple caching of the named destinations, and then my test script completes in 1.1 seconds. Given that I think the PR I have is good to go, even though it causes a substantial performance slowdown. I think the caching should be implemented separately, since it's really a separate change. I'm opening the PR now and we can discuss.