pypdf
pypdf copied to clipboard
Repeated merging does not transfer bookmarks
I have to merge several PDFs into combined files, and then combine those files again into one big PDF (I need both the big file and the in-between combined files). Also, I need a bookmark for each file, while retaining any preexisting bookmarks:
from PyPDF2 import PdfFileMerger
pdf1 = ["test1.pdf", "test2.pdf"]
pdf2 = ["test3.pdf", "test4.pdf"]
def pdf_cat(input, output):
merger = PdfFileMerger()
for pdf in input:
merger.append(pdf, bookmark=pdf, import_bookmarks=True)
merger.write(output)
pdf_cat(pdf1, "merge1.pdf")
pdf_cat(pdf2, "merge2.pdf")
pdf3 = ["merge1.pdf", "merge2.pdf"]
pdf_cat(pdf3, "mergeALL.pdf")
The first files contain all bookmarks including those on the source files, e.g.:
test1
bookmark
bookmark
test2
bookmark
However, the final file only contains the top-level bookmarks, and not those in parenthesis:
merge1
(test1)
(bookmark)
(bookmark)
(...)
merge2
If I perform the second step using Adobe Acrobat Pro all bookmarks are recovered, so from my initial testing it appears this issue is limited to the importing of bookmarks set by PyPDF2 itself. They can also be read from the intermediary files with a PdfFileReader, but they are not transferred into the new output.
EDIT: I've digged around a bit (not particularly experienced), but I think the source of this issue is in the _trim_outline method of PdfFileMerger. Commenting out line 152 in merger.py stops this behavior from happening.
Got the same issue. The workaround that I found:
merger = PdfFileMerger()
reader = PdfFileReader(pdf)
outlines = reader.getOutlines()
merger.append(pdf)
merger.bookmarks = outlines
@DmitriySelischev I don't think your method works in the general case. I've found that when you call merger.append, it adds two element to merger.bookmarks: an entry for the current page, and a list of child entries for the merged pages. Normally the list is empty. When it is not, it contains a bunch of dictionaries copied from the target. The only thing that I've found updated in those dictionaries is the /Page values. To the first order, something like the following seems to work:
from PyPdf2.generic import Destination, NumberObject
...
def append(merger, reader):
outlines = reader.outlines
offset = len(merger.pages)
merger.append(reader)
bookmarks = merger.bookmarks[-1]
if not bookmarks:
for d in outlines:
extra = (v for k, v in d.items() if k not in ('/Title', '/Page', '/Type'))
bookmarks.append(Destination(d['/Title'], NumberObject(d['/Page'] + offset),
d['/Type'], *extra))
I'm unsure as to what a recursion with multiple levels of bookmarks would look like here, but it should be a start. I'm also hoping that d returns all additional parameters in the same order that extra needs them to be in based on d['/Type'].
All,
pypdf have been upgraded, recommanding to use PdfWriter instead of PdfMerger. The append() function has been upgraded. Can some on retest and upgrade status
I close this issue as fixed as the issue is old and with no recent update. Also many other issues were dealing with merging.