pypdf BUG: Invalid Link

BUG: Invalid Link

Open rsinger417 opened this issue 4 months ago • 23 comments

passes an IndirectObject for the target page instead of an integer. passing an integer creates an invalid link.

Fixes #2443

Feb 08 '24 16:02 rsinger417

Are you able to add a corresponding test case as well which shows the previous issue and demonstrates that your fix does indeed solve this?

Feb 08 '24 16:02 stefan6419846

testpages.csv testIndexCenterPnt.csv Test Book0.pdf link test .txt

link test.txt is the code. I could not upload a *.py file. The code needs to be changed to find the supporting files as the locations are hard coded.

An invalid link is created, it works, but if you delete a page the links are broken. Acrobat will re move then as invalid links when optimized.

Feb 08 '24 16:02 rsinger417

Could you add something of this as some automated unit/integration test?

Feb 09 '24 07:02 stefan6419846

I don't know what I'm doing. I never read the book on GitHub. I don't known how to automated unit/integration test. The code I uploaded was how I tested it. The links work fine with the final pdf until until you remove pages or optimize it with Acrobat. The links are invalid because the destination page in the link are integers. The code was broke with PyPDF2 version 2.90 (7/31/2022) with the introduction of the method add_annotation in the class PdfWriter. add_link was deprecated in version 2.9.0. I got the code fix from version 2.8.1 (7/25/2022) from the add_link method in the class PdfWriter line 1532 and line 1534, this was the latest version that I found where the link worked correctly. I left line 1532 unchanged. 1532 pages_obj = cast(Dict[str, Any], self.get_object(self._pages)) 1534 page_dest = pages_obj[PA.KIDS][pagedest] # TODO: switch for external link I replaced "pagedest" with "tmp["target_page_index"]" which is the integer value of the page and "page_dest" with "taget_page" which is the IndirectObject of the page. This IndirectObject references the same page even when other pages are removed and keeping it from being an invalid link. I got rid of the TODO comment.

Feb 09 '24 17:02 rsinger417

@rsinger417 The test tests/test_generic.py::test_annotation_builder_link fails. Do you see why? (It could also be a test issue; I haven't looked into it so far)

Feb 13 '24 21:02 MartinThoma

I can't see what the code is, but by the error message it is using AnnotationBuilder.link which has been deprecated. The class Link in _markup_annotations.py should be used. I think AnnotationBuilder is completely gone in version 4.0.0

Raymond Singer Engineering Technician V T: 262.653.4154

625 52nd Street, Room 302

Kenosha, WI 53140

On Tue, Feb 13, 2024 at 3:57 PM Martin Thoma @.***> wrote:

https://github.com/rsinger417 This message originated from outside your organization

https://github.com/rsinger417 @rsinger417 https://github.com/rsinger417 The test tests/test_generic.py::test_annotation_builder_link fails. Do you see why? (It could also be a test issue; I haven't looked into it so far)

— Reply to this email directly, view it on GitHub https://github.com/py-pdf/pypdf/pull/2450#issuecomment-1942704937, or unsubscribe https://github.com/notifications/unsubscribe-auth/BF5XNWG6OMGU6BLFDUODEA3YTPOWFAVCNFSM6AAAAABDABLCT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSG4YDIOJTG4 . You are receiving this because you were mentioned.Message ID: @.***>

Feb 13 '24 22:02 rsinger417

I do not see the warning in the CI log, but an actual error which maps to a changed code line:

>           target_page = pages_obj[PA.KIDS][tmp["target_page_index"]]
E           IndexError: list index out of range

It seems like tmp["target_page_index"] returns an invalid page number?

Feb 15 '24 20:02 stefan6419846

what is the value in "target_page_index"? It should be a page number in the pdf.

Feb 15 '24 22:02 rsinger417

what is the value in "target_page_index"? It should be a page number in the pdf.

See the CI output which displays this value:

annotation = {'/Type': '/Annot', '/Subtype': '/Link', '/Rect': RectangleObject([100, 100, 300, 200]), '/Border': [50, 10, 4], '/Dest': {'target_page_index': 1, 'fit': '/Fit', 'fit_args': []}, '/P': IndirectObject(4, 0, 140453995250688)}

You should be able to locally debug this as well, as this is a permanent error. The background is that writer only has one page due to https://github.com/py-pdf/pypdf/blob/cc306ad6abfb232f6922a7f0e939831d6611d0b7/tests/test_generic.py#L911 Thus target_page_index=1 now points to an invalid page:

pages_obj: {'/Type': '/Pages', '/Count': 1, '/Kids': [IndirectObject(4, 0, 140453995250688)]}
pages_obj[PA.KIDS]: [IndirectObject(4, 0, 140453995250688)]
tmp: {'target_page_index': 1, 'fit': '/Fit', 'fit_args': []}

With this analysis, your proposed change is a breaking one and thus most likely requires a deprecation process - although it might be debatable whether the current implementation would constitute as a bug or desired behavior.

Feb 19 '24 12:02 stefan6419846

The test is wrong lines 954-955 should come before line 952 since there is no page 1 when link_annotation is call you will get the "IndexError: list index out of range." You have to add the page first. The first page in the pdf is page 0, the second page is page 1. If there is no second page in the pdf you will still get an error. You could put a test to check if it is out of range and return an error such as "No such page 1 in pdf"

945 # Part 4: Internal Link 946 with pytest.warns(DeprecationWarning): 947 link_annotation = AnnotationBuilder.link( 948 rect=(100, 100, 300, 200), 949 target_page_index=1, 950 border=[50, 10, 4], 951 ) 952 writer.add_annotation(0, link_annotation) 953 954 for page in reader.pages[1:]: 955 writer.add_page(page)

Feb 19 '24 18:02 rsinger417

The test is wrong

The test has been there before your change and worked, thus I would assume that this is/was some intended functionality. Apparently it was considered a valid use case to generate annotations for invalid pages which might be added later on. Yes, we could argue about how useful this is, but this is just how it has been in the past. The original change where this has been introduced is #1189.

Let's wait for the opinion of the other maintainers regarding this.

Feb 19 '24 19:02 stefan6419846

Unless you add_page first there is no page and thus no IndirectLink to it. If there is no IndirecLink then the link will be invalid. How can you link to an internal page that does not exist. PR #1189 does not have this test in it. The date of this PR #1189 commit was when the links became invalid PyPDF2 2.9.0 7/31/2022. The links were good in PyPDF2 2.8.1. The invalid links will work but they are not valid and will be broken if pages are removed and if the file is optimized in Acrobat they will be removed.

Feb 19 '24 20:02 rsinger417

@pubpub-zz @MartinThoma Any input on https://github.com/py-pdf/pypdf/pull/2450#issuecomment-1953060636 and how to continue with the (now failing) test regarding possibly breaking behavior?

Feb 25 '24 18:02 stefan6419846

destinations within the document are using indirect objects: but for remote destination numbers are accepted: (...)

Having a look deeper, I think the issue is coming from the error in the typing in the Link constructor which is not dealing with link to external documents.

Feb 25 '24 21:02 pubpub-zz

@pubpub-zz It seems like your misread my comment. The remaining issue is that until now, pypdf would allow links to pages not (yet) added to the file; while with the solution proposed in this PR, we would restrict this to pages already present in the file. I consider this a possibly breaking change, but wanted to get a second opinion on this before continuing.

Feb 26 '24 07:02 stefan6419846

@pubpub-zz It seems like your misread my comment. The remaining issue is that until now, pypdf would allow links to pages not (yet) added to the file; while with the solution proposed in this PR, we would restrict this to pages already present in the file. I consider this a possibly breaking change, but wanted to get a second opinion on this before continuing.

Having to add the pages within the PdfWriter before referencing in links is mandatory : you can not guess what will be the IndirectObject before adding it into the PdfWriter.

Feb 26 '24 20:02 pubpub-zz

In the current release, this would work indeed, id est referencing an invalid page. After merging this PR, this would change, thus it might be a breaking one - although I am not sure whether we consider the old behavior (allowing invalid references) a bug or not, as for a bug we would not have to really consider this a breaking change.

Feb 27 '24 09:02 stefan6419846

The legacy code was compatible with both arguments With the current code you cannot create links to external documents

Feb 27 '24 12:02 pubpub-zz

@rsinger417 Could you please check/verify whether you can keep support for external links which do not point to the same document?

Feb 28 '24 14:02 stefan6419846

external link works, the annotation uses "url" instead of "target_page_index" for an internal link such as mylink=Link(rect=[600,600,700,700], border=[0,0,1,[3,2]], url="https://github.com/py-pdf/pypdf/pull/2450") rather than mylink=Link(rect=[130,60,230,25], target_page_index=0)

Feb 28 '24 23:02 rsinger417

external link works, the annotation uses "url" instead of "target_page_index" for an internal link such as mylink=Link(rect=[600,600,700,700], border=[0,0,1,[3,2]], url="#2450") rather than mylink=Link(rect=[130,60,230,25], target_page_index=0)

in PDF there is different links: /Goto which are for link to pages /URI which are links to URI/URL but also: /GotoR which are links to pages in a remote document /GotoE which are links to pages in a document embedded in the document

It is there two links where page index are used with.

Feb 29 '24 15:02 pubpub-zz

I think this PR may also fix issue #2346. Probably it may make sense to check if indeed that issue will also be solved.

Mar 22 '24 21:03 ZupoLlask

I would assume that this is/was some intended functionality. Apparently it was considered a valid use case to generate annotations for invalid pages which might be added later on.

I'd be fine with removing it. The main question is if we need to go through the deprecation process (which would take quite long) or if we can simply say that it was a bug.

I don't see the use-case here, hence I'd say it is a bug - especially when there was this change of behavior in pypdf==2.9.0 (written by rsinger417 before; I didn't check).

Is there anything stopping people from adding the pages first?

Apr 14 '24 15:04 MartinThoma

pypdf pypdf copied to clipboard

BUG: Invalid Link

https://github.com/rsinger417 This message originated from outside your organization

pypdf
pypdf copied to clipboard