miette
miette copied to clipboard
Label location and width do not account for character length
Label locations do not correctly account for the actual width of characters.
The specific case I tested was for ☃
which is treated as being 3-wide because it is 3 bytes in utf-8.
Here is a snippet of output for the issue.
LexerError { src: NamedSource { name: "bad.wrt", source: "<redacted>", span: SourceSpan { offset: SourceOffset(4), length: SourceOffset(3) } }
× The input did not match a token rule
╭─[bad.wrt:1:1]
1 │ abc ☃ abc
· ─┬─
· ╰── This text was not recognized
╰────
This is a common issue and you need to comply UAX #29.
This should be possible to fix with the unicode-width crate.
This issue requires an exact count of the user-perceived character(s)(a.k.a Grapheme Cluster), and UAX#11 has nothing to do with this issue.
See also http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Yup, and we can probably use something like unicode_segmentation
to get that. The other issue is that if terminals display graphemes with different widths (e.g. ö̲
vs a
), then grapheme count isn't a perfect representation of horizontal offset.
Is ö̲ a combining character sequence?
@tasogare3710 It does not require an exact count of the user-perceived characters, it requires an exact width of the user-perceived characters in a terminal. Since some characters, such as emojis and East Asian characters, are usually 2 columns wide instead of 1, label locations will still be incorrect if you just count grapheme clusters.
I suggested the unicode-width
crate because that's apparently what helix uses.