Solidify our Unicode support

Open pokey opened this issue 2 years ago • 0 comments

The problem

Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as

emoji being split in half by hat renderer (eg 🤷‍♂️ being split into 🤷♂),
deleting half of a surrogate pair, which would result in invalid Unicode in a document
etc

We need to properly handle Unicode in the following areas:

tokeniser,
grapheme splitter,
word scope see https://github.com/cursorless-dev/cursorless/blob/main/src/libs/cursorless-engine/scopeHandlers/WordScopeHandler/WordTokenizer.ts and https://github.com/cursorless-dev/cursorless/blob/2e5ba054e75e010c5a5a06a740a4cc786fb81521/src/libs/cursorless-engine/tokenizer/tokenizer.ts#L89
identifier scope see https://github.com/cursorless-dev/cursorless/blob/2e5ba054e75e010c5a5a06a740a4cc786fb81521/src/libs/cursorless-engine/tokenizer/tokenizer.ts#L88
token scope (relies on tokenizer)
character scope (relies on grapheme splitter)

The solution

We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.

[ ] Look into https://github.com/flmnt/graphemer as possible replacement for VSCode splitter? Should do some benchmarking, as the VSCode implementation seems well optimised for code, where the common case is mostly ASCII

Steps

1. Copy grapheme splitter from VSCode

We begin by copying VSCode's grapheme splitter into our vendor directory (note that we could prob reuse the copy of CharCode that we've already vendored in if necessary). See https://github.com/microsoft/vscode/blob/9bc5f9b708c36a38ba7e5d41e291c1f244af5972/src/vs/editor/common/core/cursorColumns.ts#L44 for example usage of their splitter.

We will probably want to wrap their splitter in something more ergonomic, eg something that returns an iterable of ranges:

function generateGraphemesInRange(editor: TextEditor, range: Range, direction: Direction): Iterable<RangeOffsets>;

2. Create `generateGraphemeAwareMatchesInRange` function

The file will look roughly as follows:

import { Range, TextEditor } from "@cursorless/common";
import { RangeOffsets } from "../../../typings/updateSelections";
import { imap } from "itertools";

interface RangeWithGraphemes {
  offsets: RangeOffsets;
  graphemes: RangeOffsets[];
}

function* generateGraphemeAwareMatchesInRange(
  regex: Regexp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeWithGraphemes> {
  // Instantiate grapheme iterator for range
  const graphemeIterator = generateGraphemesInRange(editor, range, direction)[
    Symbol.iterator
  ]();
  let grapheme = graphemeIterator.next();

  const matchIterable = generateMatchOffsetsInRange(
    regex,
    editor,
    range,
    direction,
  );
  for (const match of matchIterable) {
    // For each match returned by regex:

    // Discard graphemes until we find the first that overlaps with match
    while (!grapheme.done && grapheme.value.end <= match.start) {
      grapheme = graphemeIterator.next();
    }

    // Collect all graphemes that overlap with match
    const graphemesInMatch: RangeOffsets[] = [];
    while (!grapheme.done && grapheme.value.start < match.end) {
      graphemesInMatch.push(grapheme.value);
      grapheme = graphemeIterator.next();
    }

    if (graphemesInMatch.length === 0) {
      throw new Error("No graphemes in match; this should never happen");
    }

    // Expand match to include all graphemes, and return the graphemes for
    // convenience
    yield {
      offsets: {
        start: graphemesInMatch[0].start,
        end: graphemesInMatch[graphemesInMatch.length - 1].end,
      },
      graphemes: graphemesInMatch,
    };
  }
}

type Direction = "forward" | "backward";

function generateMatchOffsetsInRange(
  regex: RegExp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  const offset = editor.document.offsetAt(range.start);
  const text = editor.document.getText(range);

  const matchToRange = (match: RegExpMatchArray): RangeOffsets => ({
    start: offset + match.index!,
    end: offset + match.index! + match[0].length,
  });

  // Reset the regex to start at the beginning of string, in case the regex has
  // been used before.
  // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#finding_successive_matches
  regex.lastIndex = 0;

  return direction === "forward"
    ? imap(text.matchAll(regex), matchToRange)
    : Array.from(text.matchAll(regex), matchToRange).reverse();
}

function generateGraphemesInRange(
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  throw new Error("Not implemented; we need to vendor this function in from VSCode source code");
}

3. Use `generateGraphemeAwareMatchesInRange` everywhere we need it

[ ] Switch tokeniser to use generateGraphemeAwareMatchesInRange combined with our existing token regex
[ ] Switch existing grapheme splitter to just perform normalization; using graphemes already output by generateGraphemeAwareMatchesInRange instead of doing its own grapheme splitting. This will require changes to allocateHats, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter to normalizeGraphemes or something
[ ] Switch word scope to use generateGraphemeAwareMatchesInRange combined with our existing word regex
[ ] Switch identifier scope to use generateGraphemeAwareMatchesInRange combined with our existing identifier regex
[ ] Switch token scope to use generateGraphemeAwareMatchesInRange combined with our existing token regex
[ ] Switch character scope to use generateGraphemeAwareMatchesInRange
[ ] Disable regex token expansion (see eg https://github.com/cursorless-dev/cursorless/blob/e17628ac6f77516ccd83052099ec52a80ad0ed17/src/core/updateSelections/getOffsetsForEmptyRangeInsert.ts#L68 and https://github.com/cursorless-dev/cursorless/blob/e17628ac6f77516ccd83052099ec52a80ad0ed17/src/core/updateSelections/getOffsetsForNonEmptyRangeInsert.ts#L82), as it will require work to get it to be grapheme aware, and it's not even clear if the regex expansion behaviour was ever desirable. We should just switch tokens to be closed

4. Add tests

We should add tests for the following:

[ ] 🤷‍♂️ (zero-width joiner)
[ ] 👍🏽 (Fitzpatrick modifier)
[ ] ✏️ (variation selector)
[ ] 💩 (surrogate pair)
[ ] \u0027\u005a\u0351\u036b\u0343\u036a\u0302\u036b\u033d\u034f\u0334\u0319\u0324\u031e\u0349\u035a\u032f\u031e\u0320\u034d\u0041\u036b\u0357\u0334\u0362\u0335\u031c\u0330\u0354\u004c\u0368\u0367\u0369\u0358\u0320\u005f\u0047\u0311\u0357\u030e\u0305\u035b\u0341\u0334\u033b\u0348\u034d\u0354\u0339\u004f\u0342\u030c\u030c\u0358\u0328\u0335\u0339\u033b\u031d\u0333\u005f\u0061\u0069\u0072\u0021\u033f\u030b\u0365\u0365\u0302\u0363\u0310\u0301\u0301\u035e\u035c\u0356\u032c\u0330\u0319\u0317\u0061\u0027

[ ] Steal test cases from lodash
[ ] \u0049\u00f1\u0074\u00eb\u0072\u006e\u00e2\u0074\u0069\u00f4\u006e\u00e0\u006c\u0069\u007a\u00e6\u0074\u0069\u00f8\u006e\u2603\u{1f4a9}

(from https://mathiasbynens.be/notes/javascript-unicode#poo-test)

[ ] Try to find VSCode Unicode tests

For each of the above strings, we should run:

[ ] Tokeniser
[ ] Grapheme splitter
[ ] Word scope type
[ ] Character scope type
[ ] Token scope type
[ ] Identifier scope type

Caveats

Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including \p{M} in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare cases

References

https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/
https://mathiasbynens.be/notes/javascript-unicode

Feb 10 '23 15:02 pokey

cursorless cursorless copied to clipboard

Solidify our Unicode support

The problem

The solution

Steps

1. Copy grapheme splitter from VSCode

2. Create generateGraphemeAwareMatchesInRange function

3. Use generateGraphemeAwareMatchesInRange everywhere we need it

4. Add tests

Caveats

References

cursorless
cursorless copied to clipboard

2. Create `generateGraphemeAwareMatchesInRange` function

3. Use `generateGraphemeAwareMatchesInRange` everywhere we need it