swash
swash copied to clipboard
Documentation: what does "code units" means?
The documentation of the text::cluster::Token
module does not explain what a code unit is. From the example code in the shape
module it seems that the offset
property is index of the character in the text and len
its length when represented as UTF8, but is it?
In my code I don't use UTF8 strings because I have extra information and I keep an array of "chars" like this:
(char 'A') (char 'B')(kern -0.5pt)(char '🙃')
I suppose this is three tokens but what values for offset
and len
should one use?
offset: 0 len: 'A'.len_utf8()
offset: 1 len: 'B'.len_utf8()
offset: 2 len: '🙃'.len_utf8()
Should the offset of the third token be 2 (logical index into the characters) or 3 (index into my array)?
I assume we can build the tokens from str. let char_indices
compute the offset here.
let source = "AB🙃";
source.char_indices().map(|(i, ch)| Token {
ch,
offset: i as u32,
len: ch.len_utf8() as u8,
info: ch.properties().into(),
data: 0,
});
I use SourceRange like this. The start
and end
is defined in code units. You should get the idea.
source[source_range.to_range().start..source_range.to_range().end]