cursorless
cursorless copied to clipboard
Solidify our Unicode support
The problem
Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as
- emoji being split in half by hat renderer (eg 🤷♂️ being split into 🤷♂),
- deleting half of a surrogate pair, which would result in invalid Unicode in a document
- etc
We need to properly handle Unicode in the following areas:
- tokeniser,
- grapheme splitter,
- word scope see https://github.com/cursorless-dev/cursorless/blob/main/src/libs/cursorless-engine/scopeHandlers/WordScopeHandler/WordTokenizer.ts and https://github.com/cursorless-dev/cursorless/blob/2e5ba054e75e010c5a5a06a740a4cc786fb81521/src/libs/cursorless-engine/tokenizer/tokenizer.ts#L89
- identifier scope see https://github.com/cursorless-dev/cursorless/blob/2e5ba054e75e010c5a5a06a740a4cc786fb81521/src/libs/cursorless-engine/tokenizer/tokenizer.ts#L88
- token scope (relies on tokenizer)
- character scope (relies on grapheme splitter)
The solution
We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.
- [ ] Look into https://github.com/flmnt/graphemer as possible replacement for VSCode splitter? Should do some benchmarking, as the VSCode implementation seems well optimised for code, where the common case is mostly ASCII
Steps
1. Copy grapheme splitter from VSCode
We begin by copying VSCode's grapheme splitter into our vendor directory (note that we could prob reuse the copy of CharCode that we've already vendored in if necessary). See https://github.com/microsoft/vscode/blob/9bc5f9b708c36a38ba7e5d41e291c1f244af5972/src/vs/editor/common/core/cursorColumns.ts#L44 for example usage of their splitter.
We will probably want to wrap their splitter in something more ergonomic, eg something that returns an iterable of ranges:
function generateGraphemesInRange(editor: TextEditor, range: Range, direction: Direction): Iterable<RangeOffsets>;
2. Create generateGraphemeAwareMatchesInRange function
The file will look roughly as follows:
import { Range, TextEditor } from "@cursorless/common";
import { RangeOffsets } from "../../../typings/updateSelections";
import { imap } from "itertools";
interface RangeWithGraphemes {
offsets: RangeOffsets;
graphemes: RangeOffsets[];
}
function* generateGraphemeAwareMatchesInRange(
regex: Regexp,
editor: TextEditor,
range: Range,
direction: Direction,
): Iterable<RangeWithGraphemes> {
// Instantiate grapheme iterator for range
const graphemeIterator = generateGraphemesInRange(editor, range, direction)[
Symbol.iterator
]();
let grapheme = graphemeIterator.next();
const matchIterable = generateMatchOffsetsInRange(
regex,
editor,
range,
direction,
);
for (const match of matchIterable) {
// For each match returned by regex:
// Discard graphemes until we find the first that overlaps with match
while (!grapheme.done && grapheme.value.end <= match.start) {
grapheme = graphemeIterator.next();
}
// Collect all graphemes that overlap with match
const graphemesInMatch: RangeOffsets[] = [];
while (!grapheme.done && grapheme.value.start < match.end) {
graphemesInMatch.push(grapheme.value);
grapheme = graphemeIterator.next();
}
if (graphemesInMatch.length === 0) {
throw new Error("No graphemes in match; this should never happen");
}
// Expand match to include all graphemes, and return the graphemes for
// convenience
yield {
offsets: {
start: graphemesInMatch[0].start,
end: graphemesInMatch[graphemesInMatch.length - 1].end,
},
graphemes: graphemesInMatch,
};
}
}
type Direction = "forward" | "backward";
function generateMatchOffsetsInRange(
regex: RegExp,
editor: TextEditor,
range: Range,
direction: Direction,
): Iterable<RangeOffsets> {
const offset = editor.document.offsetAt(range.start);
const text = editor.document.getText(range);
const matchToRange = (match: RegExpMatchArray): RangeOffsets => ({
start: offset + match.index!,
end: offset + match.index! + match[0].length,
});
// Reset the regex to start at the beginning of string, in case the regex has
// been used before.
// See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#finding_successive_matches
regex.lastIndex = 0;
return direction === "forward"
? imap(text.matchAll(regex), matchToRange)
: Array.from(text.matchAll(regex), matchToRange).reverse();
}
function generateGraphemesInRange(
editor: TextEditor,
range: Range,
direction: Direction,
): Iterable<RangeOffsets> {
throw new Error("Not implemented; we need to vendor this function in from VSCode source code");
}
3. Use generateGraphemeAwareMatchesInRange everywhere we need it
- [ ] Switch tokeniser to use
generateGraphemeAwareMatchesInRangecombined with our existing token regex - [ ] Switch existing grapheme splitter to just perform normalization; using graphemes already output by
generateGraphemeAwareMatchesInRangeinstead of doing its own grapheme splitting. This will require changes toallocateHats, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter tonormalizeGraphemesor something - [ ] Switch word scope to use
generateGraphemeAwareMatchesInRangecombined with our existing word regex - [ ] Switch identifier scope to use
generateGraphemeAwareMatchesInRangecombined with our existing identifier regex - [ ] Switch token scope to use
generateGraphemeAwareMatchesInRangecombined with our existing token regex - [ ] Switch character scope to use
generateGraphemeAwareMatchesInRange - [ ] Disable regex token expansion (see eg https://github.com/cursorless-dev/cursorless/blob/e17628ac6f77516ccd83052099ec52a80ad0ed17/src/core/updateSelections/getOffsetsForEmptyRangeInsert.ts#L68 and https://github.com/cursorless-dev/cursorless/blob/e17628ac6f77516ccd83052099ec52a80ad0ed17/src/core/updateSelections/getOffsetsForNonEmptyRangeInsert.ts#L82), as it will require work to get it to be grapheme aware, and it's not even clear if the regex expansion behaviour was ever desirable. We should just switch tokens to be
closed
4. Add tests
We should add tests for the following:
- [ ] 🤷♂️ (zero-width joiner)
- [ ] 👍🏽 (Fitzpatrick modifier)
- [ ] ✏️ (variation selector)
- [ ] 💩 (surrogate pair)
- [ ]
\u0027\u005a\u0351\u036b\u0343\u036a\u0302\u036b\u033d\u034f\u0334\u0319\u0324\u031e\u0349\u035a\u032f\u031e\u0320\u034d\u0041\u036b\u0357\u0334\u0362\u0335\u031c\u0330\u0354\u004c\u0368\u0367\u0369\u0358\u0320\u005f\u0047\u0311\u0357\u030e\u0305\u035b\u0341\u0334\u033b\u0348\u034d\u0354\u0339\u004f\u0342\u030c\u030c\u0358\u0328\u0335\u0339\u033b\u031d\u0333\u005f\u0061\u0069\u0072\u0021\u033f\u030b\u0365\u0365\u0302\u0363\u0310\u0301\u0301\u035e\u035c\u0356\u032c\u0330\u0319\u0317\u0061\u0027
- [ ] Steal test cases from lodash
- [ ]
\u0049\u00f1\u0074\u00eb\u0072\u006e\u00e2\u0074\u0069\u00f4\u006e\u00e0\u006c\u0069\u007a\u00e6\u0074\u0069\u00f8\u006e\u2603\u{1f4a9}
(from https://mathiasbynens.be/notes/javascript-unicode#poo-test)
- [ ] Try to find VSCode Unicode tests
For each of the above strings, we should run:
- [ ] Tokeniser
- [ ] Grapheme splitter
- [ ] Word scope type
- [ ] Character scope type
- [ ] Token scope type
- [ ] Identifier scope type
Caveats
Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including \p{M} in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare cases
References
- https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/
- https://mathiasbynens.be/notes/javascript-unicode