client icon indicating copy to clipboard operation
client copied to clipboard

Error when saving if 32-char prefix/suffix of quote selector ends mid-way through a unicode surrogate pair

Open robertknight opened this issue 3 years ago • 2 comments

Steps to reproduce:

  1. On https://github.com/typescript-eslint/typescript-eslint/discussions/6014, activate the extension and try to annotate exactly the text "no-inferrable-types" in the left column of a table
  2. Try to save the annotation

Expected result: Annotation is saved Actual result: 500 server error from h

There are two issues here:

  1. From inspecting the client's store (set window.debug = true) we can see that the TextQuoteSelector of the new annotation contains a string that ends mid-way through a UTF-16 surrogate pair (see the suffix field):

    TextQuoteSelector invalid unicode
  2. In h, the POST /api/annotations API fails when attempting to store the JSON blob in Postgres: https://sentry.io/organizations/hypothesis/issues/2551504044/?project=37293&query=is%3Aunresolved&referrer=issue-stream

InvalidTextRepresentation: invalid input syntax for type json
LINE 1: ...cript-eslint/typescript-eslint/discussions/6014', '[{"type":...
                                                             ^
DETAIL:  Unicode low surrogate must follow a high surrogate.
CONTEXT:  JSON data, line 1: ...ud83e\uddf1\n\ud83d\udfe9\n\n\n\n\n\n", "suffix":...

There are two fixes needed here:

  1. The client shouldn't be submitting selectors with invalid Unicode to the server
  2. h shouldn't crash with an internal server error. It should either fail with a 4xx error, or silently fix up the invalid Unicode (eg. by ignoring the isolated Unicode high surrogate)

A workaround for users is to change the text selection slightly so that the prefix and suffix end at a slightly different point.

robertknight avatar Nov 30 '22 07:11 robertknight

We have some code in https://github.com/hypothesis/client/blob/631372eefbda552476ebb438205231e1f25ad97e/src/sidebar/util/unicode.ts#L27 that shows how to do Unicode-aware truncation of strings.

robertknight avatar Sep 07 '23 13:09 robertknight

This type of payloads are now rejected by the backend.

We seem to have a few hundred instandtance os this problem every month (900 last 30 days) so it is worth fixing.

marcospri avatar Feb 20 '24 08:02 marcospri