product-backlog Support PDF annotations that span multiple pages

Updated summary from @robertknight: The Hypothesis client does not allow the user to annotation a selection in a PDF that spans multiple pages.

Steps

Visit https://www.supremecourt.gov/opinions/12pdf/12-96_6k47.pdf
This: http://jonudell.net/h/pdf-select-across-pages.mp4

Dec 10 '16 04:12 judell

This is not so much a bug is a known and long-standing limitation of the PDF anchoring in the Hypothesis client - annotations that span multiple pages are not supported.

Dec 12 '16 10:12 robertknight

User request: https://hypothesis.zendesk.com/agent/tickets/2232

"It'd be great to be able to highlight/annotate between two pages. The system doesn't seem to like highlighting from the end of one page to the beginning of another."

Feb 22 '18 17:02 klemay

User request: https://hypothesis.zendesk.com/agent/tickets/2967

I normally use Safari on OSX, but the bookmarklet is not an option for annotating local PDFs, so I have been using Chrome, which in general I find clunky, and in particular am frustrated by its apparent inability to create an annotation based on selected text that crosses a PDF page boundary.

Is there a solution to this problem, which for me constitutes a major downside in the use case of locally saved PDFs… ?

Sep 03 '18 16:09 klemay

Feature request from Twitter:

Hoping that soon, highlights and annotations across a page break will work in @hypothes_is - as of now, it doesn't seem to. I can highlight it correctly, but when I click highlight or annotate, it just flickers. Seems like an important roadmap item (or bug?)

Nov 26 '18 19:11 klemay

Came up again today - user thought it was a bug: https://hypothesis.zendesk.com/agent/tickets/3594

Jan 10 '19 19:01 klemay

As Sean Roberts mentioned way back when, it would be nice if we could detect that the selection crosses page boundaries and prevent the user from making an invitation that is guaranteed to orphan.

Feb 08 '19 18:02 judell

Came up today. https://app.hubspot.com/contacts/6291320/ticket/309936759/ I think the user's behavior (selecting text over a page break) is definitely possible to run in to. Also somewhat typical is the user's reaction: "this is broken, I'm not going to try a different part of the text".

Feb 26 '21 21:02 mkdir-washington-edu

User asked about this in Twitter: https://twitter.com/pivic/status/1373895420271812608

Mar 24 '21 16:03 mattdricker

Another report of this: https://app.hubspot.com/contacts/6291320/ticket/508528409/

Aug 05 '21 12:08 mkdir-washington-edu

We invested some time there as well and came to the conclusion that the browser function document.getSelection().getRangeAt(0) returns as an endContainer the CanvasWrapper instead of a text node. The function resolveOffsets in text-range.js then throws an error 'Offset exceeds text length', because currentNode object, due to the TextNode filter, returns null.

Sep 08 '22 11:09 dennis-zyska

We invested some time there as well and came to the conclusion that the browser function document.getSelection().getRangeAt(0) returns as an endContainer the CanvasWrapper instead of a text node.

The problem is bigger than a minor bug as this comment suggests. The code that finds ("anchors") text within PDFs currently assumes that all the text for a single annotation occurs on a single page. This assumption also touches other aspects of how PDF annotations work, eg. handling of pages that are off-screen and not rendered. Somehow it would need to be reworked to allow parts of the quote to span multiple pages.

Sep 08 '22 11:09 robertknight

Do you have some idea of what approach to take for solving this, and how much work this might take?

Oct 05 '22 08:10 ar-jan

Do you have some idea of what approach to take for solving this, and how much work this might take?

No, it hasn't been explored in detail yet. There are some challenges which make it non-trivial. Some notes on these:

Annotations created on PDFs currently record which part of the document they reference via a text quote and the surrounding text. There are no PDF page numbers or PDF page coordinates in the annotation data. This is a source of challenges in supporting annotations that span multiple pages. See also https://github.com/hypothesis/client/issues/3720.
The current logic for locating annotations in a document operates on one page of text at a time. This helps to partition the problem into smaller tasks in long documents. Supporting annotations that span multiple pages should not regress performance in long documents.
A page in a PDF can be either rendered or a placeholder, depending on whether it is in/near the viewport or not. Hypothesis executes different code paths to handle both cases, but currently assumes that an annotation is either on a rendered page or a placeholder page. If an annotation spans multiple pages, there is a possibility that it spans both rendered and non-rendered pages.
Quite often the text layer of a PDF includes stuff before / after the main body of the document, which would get selected if creating a selection that goes from one page to the next. This might appear as junk in captured quotes. I'm not sure how big of an issue this will be yet.

Oct 05 '22 10:10 robertknight

@mattdricker you had a comment related to a Support ticket in this other issue which was closed in favor of the issue above.

Dec 07 '22 15:12 mkdir-washington-edu

Thanks. Above user ticket here: https://app.hubspot.com/contacts/6291320/ticket/1157100967

Dec 07 '22 15:12 mattdricker

Similar incident: https://app.hubspot.com/contacts/6291320/record/0-5/2581904181

Apr 22 '24 19:04 janraev

product-backlog product-backlog copied to clipboard

Support PDF annotations that span multiple pages

Steps

product-backlog
product-backlog copied to clipboard