product-backlog
product-backlog copied to clipboard
Support PDF annotations that span multiple pages
Updated summary from @robertknight: The Hypothesis client does not allow the user to annotation a selection in a PDF that spans multiple pages.
Steps
- Visit https://www.supremecourt.gov/opinions/12pdf/12-96_6k47.pdf
- This: http://jonudell.net/h/pdf-select-across-pages.mp4
This is not so much a bug is a known and long-standing limitation of the PDF anchoring in the Hypothesis client - annotations that span multiple pages are not supported.
User request: https://hypothesis.zendesk.com/agent/tickets/2232
"It'd be great to be able to highlight/annotate between two pages. The system doesn't seem to like highlighting from the end of one page to the beginning of another."
User request: https://hypothesis.zendesk.com/agent/tickets/2967
I normally use Safari on OSX, but the bookmarklet is not an option for annotating local PDFs, so I have been using Chrome, which in general I find clunky, and in particular am frustrated by its apparent inability to create an annotation based on selected text that crosses a PDF page boundary.
Is there a solution to this problem, which for me constitutes a major downside in the use case of locally saved PDFs… ?
Feature request from Twitter:
Hoping that soon, highlights and annotations across a page break will work in @hypothes_is - as of now, it doesn't seem to. I can highlight it correctly, but when I click highlight or annotate, it just flickers. Seems like an important roadmap item (or bug?)
Came up again today - user thought it was a bug: https://hypothesis.zendesk.com/agent/tickets/3594
As Sean Roberts mentioned way back when, it would be nice if we could detect that the selection crosses page boundaries and prevent the user from making an invitation that is guaranteed to orphan.
Came up today. https://app.hubspot.com/contacts/6291320/ticket/309936759/ I think the user's behavior (selecting text over a page break) is definitely possible to run in to. Also somewhat typical is the user's reaction: "this is broken, I'm not going to try a different part of the text".
User asked about this in Twitter: https://twitter.com/pivic/status/1373895420271812608
Another report of this: https://app.hubspot.com/contacts/6291320/ticket/508528409/
We invested some time there as well and came to the conclusion that the browser function document.getSelection().getRangeAt(0) returns as an endContainer the CanvasWrapper instead of a text node. The function resolveOffsets in text-range.js then throws an error 'Offset exceeds text length', because currentNode object, due to the TextNode filter, returns null.
We invested some time there as well and came to the conclusion that the browser function document.getSelection().getRangeAt(0) returns as an endContainer the CanvasWrapper instead of a text node.
The problem is bigger than a minor bug as this comment suggests. The code that finds ("anchors") text within PDFs currently assumes that all the text for a single annotation occurs on a single page. This assumption also touches other aspects of how PDF annotations work, eg. handling of pages that are off-screen and not rendered. Somehow it would need to be reworked to allow parts of the quote to span multiple pages.
Do you have some idea of what approach to take for solving this, and how much work this might take?
Do you have some idea of what approach to take for solving this, and how much work this might take?
No, it hasn't been explored in detail yet. There are some challenges which make it non-trivial. Some notes on these:
- Annotations created on PDFs currently record which part of the document they reference via a text quote and the surrounding text. There are no PDF page numbers or PDF page coordinates in the annotation data. This is a source of challenges in supporting annotations that span multiple pages. See also https://github.com/hypothesis/client/issues/3720.
- The current logic for locating annotations in a document operates on one page of text at a time. This helps to partition the problem into smaller tasks in long documents. Supporting annotations that span multiple pages should not regress performance in long documents.
- A page in a PDF can be either rendered or a placeholder, depending on whether it is in/near the viewport or not. Hypothesis executes different code paths to handle both cases, but currently assumes that an annotation is either on a rendered page or a placeholder page. If an annotation spans multiple pages, there is a possibility that it spans both rendered and non-rendered pages.
- Quite often the text layer of a PDF includes stuff before / after the main body of the document, which would get selected if creating a selection that goes from one page to the next. This might appear as junk in captured quotes. I'm not sure how big of an issue this will be yet.
@mattdricker you had a comment related to a Support ticket in this other issue which was closed in favor of the issue above.
Thanks. Above user ticket here: https://app.hubspot.com/contacts/6291320/ticket/1157100967
Similar incident: https://app.hubspot.com/contacts/6291320/record/0-5/2581904181