client icon indicating copy to clipboard operation
client copied to clipboard

Update PDF.js (2025 edition)

Open robertknight opened this issue 11 months ago • 13 comments

PDF.js was last updated nearly three years ago.

This issue covers updating to the latest version and addressing any compatibility issues we run into.

Issues identified:

  1. Mismatch of Unicode normalization between text returned by text extraction APIs and text layer. There may also be mismatches in normalization with saved annotations' quotes.
  2. Position of highlights is not updated after changing zoom level
  3. Clicking on annotations doesn't select them in the sidebar, although creating a selection that includes an annotation does indicate there is a highlight (something preventing clicks from propagating?)
  4. The supported browser range for the "legacy" build changed. As of May 2025 it requires Safari 16.4 (Mar 2023) and Chrome 110 (Feb 2023)

robertknight avatar Jan 29 '25 16:01 robertknight

From https://github.com/hypothesis/pdf.js-hypothes.is/pull/31#issuecomment-2621975473:

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Intermittently I have seen an error about a getDownloadInfo property referencing code in pdf-metadata.ts in the client. This method still exists upstream, so it looks like we might be trying to access it before app.pdfDocument has been set?

robertknight avatar Jan 29 '25 16:01 robertknight

The most obvious issue is that the position of highlights gets out of sync with the text when zooming in and out. PDF.js has changed the way they update the DOM after zoom. Previously it would remove and recreate various elements and that would trigger re-creation of the highlights. Now it looks like it does something cheaper where a --scale-factor CSS variable is updated and that scales various PDF.js viewer elements. It doesn't affect the highlight elements we have drawn however.

Image

robertknight avatar Jan 29 '25 16:01 robertknight

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Digging into this issue, on page 2 of the TraceMonkey paper used in the PDF.js demo I see differences in Unicode normalization between the text extracted from the PDF by our getPageTextContent function and the textContent property of the DOM node from the text layer where the user made a text selection.

In the text layer:

Image

Here "flow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

This may have been caused by https://github.com/mozilla/pdf.js/pull/16200. We have a translateOffsets function that can handle whitespace differences between the text layer and the extracted text. We may also need to handle Unicode normalization differences here.

robertknight avatar Jan 29 '25 16:01 robertknight

Related to the change in Unicode normalization, there may be some accessibility issues arising from this. See https://github.com/nvaccess/nvda/issues/14740.

robertknight avatar Jan 30 '25 12:01 robertknight

Here "flow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

To summarize the issue for future reference:

Older versions of PDF.js used to apply Unicode normalization to the text coming from both the text extraction API and in the hidden text layer in the DOM. The latest version normalizes the text in the text layer but, by default, not that returned by the text extraction API. The rationale is that it makes the characters in the text layer align better with the visual text. A ligature for example that is one character visually is now one character in the text instead of two.

Hypothesis uses the text extraction API to efficiently gather text from the PDF in order to find matches for saved annotations' quotes. After a match has been found however, we need to highlight the corresponding text in the text layer. If the text layer and extraction API return different text, the position range calculated based on text from the extraction API needs to be translated into a position range within the text layer. Likewise when saving a new annotation, we ideally should translate the position range in the text layer into that of the text returned by the API. There is some tolerance here because the position saved for annotations is only a hint for quote anchoring.

We already deal with the fact that whitespace can be different in the text layer versus the extraction API result. This is handled by translateOffsets as mentioned above.

There are a few approaches we could take:

  1. Request un-normalized text from the text extraction API, so that it matches the text layer. This would mean the extracted text will be different from what was previously extracted which will have consequences for quote anchoring, although the fuzzy matching does allow some tolerance for this. Saving characters like ligatures in annotation quotes may cause some headaches elsewhere in Hypothesis (eg. in search), although we do have some normalization that happens.
  2. Expand the existing logic for handling whitespace differences between extracted text and the text layer, so that it can also handle differences in normalization. See here. This is probably the right way to go, but only if it can be done without adding a ton of complexity.

robertknight avatar Jan 30 '25 14:01 robertknight

Context: what is the work required to update to a new PDF.js. This is not about existing problems. Let's make time to do it this year (2025).

mkdir-washington-edu avatar Apr 10 '25 15:04 mkdir-washington-edu

Position of highlights is not updated after changing zoom level

This may be fixed by https://github.com/hypothesis/client/pull/7019.

robertknight avatar Apr 29 '25 11:04 robertknight

  1. Position of highlights is not updated after changing zoom level
  2. Clicking on annotations doesn't select them in the sidebar, although creating a selection that includes an annotation does indicate there is a highlight (something preventing clicks from propagating?)

These issues are now fixed. The issue with zoom level changing was fixed by https://github.com/hypothesis/client/pull/7018. I'm not sure exactly which of the recent changes fixed selection in the sidebar, so that needs probing.

The supported browser range for the "legacy" build changed. As of May 2025 it requires Safari 16.4 (Mar 2023) and Chrome 110 (Feb 2023)

This means a browser that is ~2 years old. That's a limitation I think we can live with, although it might cause issues on some older iOS or Android devices.

A new issue that has appeared is that thumbnail rendering doesn't work in the latest PDF.js version. This feature expands the surface area of PDF.js APIs that we interact with somewhat, so this is something to be mindful of.

robertknight avatar May 05 '25 09:05 robertknight

Thank you for this, up-to-date PDF.js will be awesome

EugeneIstomin avatar May 06 '25 06:05 EugeneIstomin

A new issue that has appeared is that thumbnail rendering doesn't work in the latest PDF.js version. This feature expands the surface area of PDF.js APIs that we interact with somewhat, so this is something to be mindful of.

After further testing, it seems this issue only affects Firefox. I can use image annotations on https://mozilla.github.io/pdf.js/web/viewer.html in Safari and Chrome, and thumbnails appear in the sidebar. In Firefox, I can verify that an ImageBitmap is produced in the guest frame for the thumbnail, but it appears to get lost in transmission when sent to the host frame via MessagePort.postMessage. The sidebar never receives the bitmap and the AnnotationThumbnail component displays "Loading thumbnail..." forever.

Curiously rendering thumbnails in the older version of PDF.js we use does work in Firefox.

Potentially related issues: https://bugzilla.mozilla.org/buglist.cgi?quicksearch=ImageBitmap

  • https://bugzilla.mozilla.org/show_bug.cgi?id=1575501 ("Support ImageBitmap in all postMessage() APIs"). This says "In particular, we don't support it for BroadcastChannel and MessageChannel", and yet thumbnails do work in older PDF.js versions.
  • https://bugzilla.mozilla.org/show_bug.cgi?id=1775392 (" structuredClone throws DomException when called with ImageBitmap derived from OffscreenCanvas")
  • https://bugzilla.mozilla.org/show_bug.cgi?id=1565205 ("Ensure ImageBitmap object whose origin-clean flag is false cannot enter COEP process")

The workaround for Firefox might involve generating Blobs from the OffscreenCanvas instead or extracting pixel data from the canvas and sending that.

robertknight avatar May 06 '25 09:05 robertknight

The supported browser range for the "legacy" build changed. As of May 2025 it requires Safari 16.4 (Mar 2023) and Chrome 110 (Feb 2023)

The minimum browser versions listed on that page are:

  • Chrome 110
  • Safari 16.4
  • Firefox ESR. This is a moving target, so I'm going to take Firefox 115 which is the ESR of a similar vintage to Chrome.

Analyzing LMS access logs from the past day, following the steps at https://github.com/hypothesis/user-agent-analysis, I get:

python3 analyze_stats.py logs.csv "chrome>=110,safari>=16,firefox>=115"
92512 rows, 87641 valid (94.7%), 4871 skipped
98.91% of rows match query

So this means that ~1.1% of LMS queries were from browsers older than this baseline. Normally the cutoff we use when updating our browser baseline is ~1%.

robertknight avatar May 15 '25 09:05 robertknight

Summarizing the status of the issues that were identified:

  1. Mismatch of Unicode normalization between text returned by text extraction APIs and text layer. There may also be mismatches in normalization with saved annotations' quotes.

Will be fixed by https://github.com/hypothesis/client/pull/7096.

  1. Position of highlights is not updated after changing zoom level

Fixed by https://github.com/hypothesis/client/pull/7018.

  1. Clicking on annotations doesn't select them in the sidebar, although creating a selection that includes an annotation does indicate there is a highlight (something preventing clicks from propagating?)

This was fixed at some point during the work on image annotations. TBC which change resolved the problem.

  1. The supported browser range for the "legacy" build changed. As of May 2025 it requires Safari 16.4 (Mar 2023) and Chrome 110 (Feb 2023)

Will be fixed by https://github.com/hypothesis/client/pull/7080.

robertknight avatar May 19 '25 15:05 robertknight

The Unicode normalization issue was solved by https://github.com/hypothesis/client/pull/7123. A this point Hypothesis is now compatible with the current versions of PDF.js. When we come to update the viewer in Via, there is an issue that the new version has built-in annotation tools which might cause confusion for eg. LMS users because they are not integrated with Hypothesis. I think we'll need to modify the viewer UI so we can remove those.

robertknight avatar Jun 04 '25 10:06 robertknight