Positions are captured / anchored in terms of code units rather than code points
According to the specification the selection (for quote and position selectors) should be expressed in code points rather than code units.
The positions currently recorded and anchored via fromRange and toRange are recorded in terms of indexes within a JS string retrieved by iterating over the textContent property of nodes before and spanned by the selection. Since JS strings use UTF-16, I believe this means that the generated positions will only be correct, as per the spec, if all characters in the page can be represented by a single UTF-16 char.
Fixing this would be a breaking change for any targets which contain non-BMP text, so I would suggest that if we do decide to do so, it should be a major version bump.
Steps to reproduce using Hypothesis (which uses dom-anchor-text-position to capture position selectors)
- Create a page with this HTML:
<head>
<meta charset="utf-8">
</head>
<body>
🤔🤔🤔🤔:thinking
</body>
- Open the Hypothesis client and select the text ":thinking" above
- Click the "Annotate" button
- Save the annotation
Expected: Hypothesis client POSTs a new annotation with a text position selector with (start, end) offsets of (5, 14), since there are 5 characters (one new line plus 4 emoji) before the start of the selection. Actual: Text position offsets are (9, 18).
See also tilgovi/dom-seek#1