pypdf HTML links to document page broken after merge

HTML links to document page broken after merge

Open sanohin opened this issue 5 years ago • 5 comments

If you have links in PDF file (html anchor tag with element id as href) they would not work after merging.

<a href="#target">Go to target</a>
....some content
<div id="target>Target here</div>

report = PdfFileReader(BytesIO(pdf)) # rendered html file to pdf with html links
merger = PdfFileMerger(strict=False)
merger.append(report)

result = BytesIO()
merger.write(result)
result.seek(0)
return result.read()

Nov 21 '18 14:11 sanohin

I second this issue. It occurs when merging even via the pageObject.mergePage method

Jun 07 '19 18:06 jackneil

I'll jump on the train - having the same issue here!

Apr 20 '20 12:04 advename

Can anybody share a PDF which shows this issue? Is it still an issue with the latest PyPDF2 version?

Jun 26 '22 12:06 MartinThoma

@MartinThoma Not sure if you're still looking for an example, but you can find one below:

I have generated book.pdf using sample doc creation of jupyter-book project. You can see that internal HTML links in Contents section of book_out.pdf don't work, which work fine in book.pdf. The conversion from book.pdf to book_out.pdf uses the below code snippet:

from PyPDF2 import PdfReader, PdfWriter

PDF = "./doc/_build/pdf/book.pdf"
OUT_PDF = "./doc/_build/pdf/book_out.pdf"

reader = PdfReader(PDF)
writer = PdfWriter()

for page in reader.pages:
    writer.addPage(page)

with open(OUT_PDF, "wb") as f:
    writer.write(f)

Jul 19 '22 09:07 SimplyOm

Thank you @SimplyOm :hugs:

Jul 19 '22 12:07 MartinThoma

Hi @MartinThoma , is this solved? facing same issue

Oct 04 '22 04:10 dc-em

In progress, should come back soon

Oct 04 '22 05:10 pubpub-zz

I think the issue is found : a) The links are using named dest, not copied with the add_page : I've coded the append/merge functions into PdfWriter
b) some types were not matching : I've added a function for that (implemented in the merge changes in pr #1371 (still in progress)

Oct 11 '22 21:10 pubpub-zz

On a relevant issue, when using merge, the internal links of a pdf seem to be broken. I refer to links, for example, to a reference at the end of the pdf in a research paper or to a section of the paper. Any ideas on how to keep those links active when mergin?

Oct 30 '22 05:10 manathan1984

@manathan1984, it is now recommended to use PdfWriter and append() that should fix the issues. Can you try it and update the status of this issue?

Feb 09 '23 05:02 pubpub-zz

@pubpub-zz

writer = PdfWriter()
for pdf in ["cover_page.pdf", "main_report.pdf", "back_cover.pdf"]:
    writer.append(pdf)

with open("result.pdf", "wb") as f:
    writer.write(f)

getting below error when using PdfWriter and append() .

AttributeError: 'NumberObject' object has no attribute 'indirect_reference'

Feb 09 '23 09:02 DX9807

@DX9807 Can you please provide the pdf

Feb 09 '23 12:02 pubpub-zz

@pubpub-zz Check the files given below back_cover.pdf central.pdf cover_page.pdf

While trying to merge the above pdfs using PdfWriter and its append method I am getting this error.

AttributeError: 'NumberObject' object has no attribute 'indirect_reference'

But when I use PdfMerger class and the corresponding append method the pdfs get merged but the internal hyperlinks are not working in this case,

Feb 10 '23 06:02 DX9807

Hello, it is included in 3.16.0?

Sep 12 '23 08:09 rocketrefrigerator

If you have a look at the last commit referenced here (https://github.com/py-pdf/pypdf/commit/b1fa953bf7585e799c1541fd4bd5b0b8daa247bc), you will see that this fix is included since version 3.11.1.

Sep 12 '23 08:09 stefan6419846

pypdf pypdf copied to clipboard

HTML links to document page broken after merge

pypdf
pypdf copied to clipboard