oxc
oxc copied to clipboard
feat(parser): utf16 spans
💡 Utf-8 to Utf-16 conversion is costly for downstream tools, is it doable to offer utf-16 spans from our parser?
I don't think this would be too difficult to implement in the lexer, while it's running through the source character by character anyway.
But is the idea that spans always specify positions as UTF16? Or spans specify both UTF8 and UTF16 positions? Or one or the other depending on user choice?
As far as I understand, the source map "standard" states positions must be UTF16. But I have no idea what actual tools in the real world do.
By the way, if you operate on source as a slice of u8
bytes, rather than char
s, calculating UTF16 length can be quite fast. Here's some WIP code I wrote a while back while hacking on SWC, which could probably be massaged a bit to take advantage of SIMD:
/// Get length of string as number of UTF16 characters.
///
/// The implementation below is equivalent to
/// `s.chars().map(|c| c.len_utf16()).sum::<usize>()`
/// but hopefully faster.
///
/// It relies on translation of UTF8 coding to UTF16:
///
/// - 1-byte UTF8 sequence
/// = 1st byte `0xxxxxxx`
/// -> 1 x UTF16 char
/// UTF16 len = UTF8 len
/// - 2-byte UTF8 sequence
/// = 1st byte `110xxxxx`, remaining bytes `10xxxxxx`
/// -> 1 x UTF16
/// UTF16 len = UTF8 len - 1
/// - 3-byte UTF8 sequence
/// = 1st byte `1110xxxx`, remaining bytes `10xxxxxx`
/// -> 1 x UTF16
/// UTF16 len = UTF8 len - 2
/// - 4-byte UTF8 sequence
/// = 1st byte `1111xxxx`, remaining bytes `10xxxxxx`
/// -> 2 x UTF16
/// UTF16 len = UTF8 len - 2
///
/// So UTF16 len =
/// UTF8 len
/// minus count of UTF8 bytes indicating start of sequence 2 bytes or longer
/// minus count of UTF8 bytes indicating start of sequence 3 bytes or longer
///
/// See: https://stackoverflow.com/questions/5728045/c-most-efficient-way-to-determine-how-many-bytes-will-be-needed-for-a-utf-16-st
fn utf16_len(s: &str) -> usize {
s.len()
- s.bytes()
.map(|c| (c >= 0xC0) as usize + (c >= 0xE0) as usize)
.sum::<usize>()
}
But is the idea that spans always specify positions as UTF16? Or spans specify both UTF8 and UTF16 positions? Or one or the other depending on user choice?
This is going to be a hard one.
Here is the actual problem (sorry I should have stated the actual problem instead of doing a X/Y problem).
When the span value is used on the JS side, they need to be converted to a utf16 span, otherwise unicode will blow up everything.
https://github.com/oxc-project/oxc/blob/1f3f21d722d22f97f20d9d2b47050dc86e185660/website/playground/index.js#L706-L708
In source map generation, we need to keep track of line and column indices instead of byte offsets, all in UTF16
So what I'm actually looking for is a O(1) solution around Utf8 -> Utf16 conversion, and it seemed trivial to get the values from the lexer.
By they way, the tendril library has a great summary around these topics (along with the other Atom issue) https://github.com/servo/tendril
But I don't have the time to go down the rabbit hole :-( For example, wtf is WTF-8 lol
Ah. Interesting.
I think the 2 problems, although related, can be tackled separately.
Span positions (start
and end
)
Should Span
's current start
and end
be changed from UTF8 positions to UTF16 positions?
i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?
If so, that could be done in the lexer, and that'd solve problem no. 1.
That could be implemented now, without finding a solution to part 2.
But does anything/anyone depend on start
and end
being UTF8 positions? (e.g. to slice up the source text in Rust).
If absolutely necessary, both could be supported, with a parser option determining whether positions are recorded as UTF16 or UTF8.
Source-map friendly positions
Line+column positions could also be calculated quite easily for every token in the lexer, and then added to Span
.
Getting the line/column for any AST node would then be free.
However, storing all the info in each AST node would increase the size of Span
from 8 to 24 bytes:
struct LineColumn { line: u32, column: u32 }
struct Span {
start: u32,
end: u32,
start_line_column: LineColumn,
end_line_column: LineColumn,
}
That's not ideal, as every node has a Span
, so it's a lot of memory in aggregate.
If it's acceptable for getting line + column to rely on data outside of the AST node, I can see two ways to reduce size of Span
:
Solution 1
struct Span { start: u32, end: u32, start_line: u32, end_line: u32 }
Lexer would also fill a Vec
recording the start position (relative to start of file) for each line. e.g. if first line of source is 20 characters, followed by \n
, line_positions[0] = 0
& line_positions[1] = 21
.
So then to get line + column for a Span
:
fn get_columns(span: &Span, line_positions: &[u32]) -> (u32, u32) {
let start_col = span.start - line_positions[span.start_line];
let end_col = span.end - line_positions[span.end_line];
(start_col, end_col)
}
That's O(1) and Span
is 16 bytes.
Solution 2
struct Span { start: u32, end: u32 }
Lexer also records Vec
of line start positions (as before).
Transforming start
to column + line would involve:
- Binary search of
line_positions
to find the line with latest start position which is <=start
. -
column = start - line_start
This is not O(1) but it's better than O(n) (O(log n)?). Span
is 8 bytes.
There's also scope for an optimization: Very often an AST node will have a source position close to its preceding/enclosing node, so when traversing an entire AST to generate a source map, there could be a fast path for "this node is on same line as previous node". So maybe it'd be quite efficient.
I'm not sure which is preferable - O(1), or minimizing size of Span
. There may well also be better solutions I've not thought of.
Looks like SWC is doing something like Solution 2: https://github.com/swc-project/swc/blob/b76dd4635d02b1fc76a6f799d4bc70710025f889/crates/swc_common/src/syntax_pos.rs#L1440-L1445
WTF
I started reading into WTF-8 (!). I can't claim to have got my head around it entirely. But at first glance, I don't think it helps with the source maps line/column problem and, while it might help with UTF16 positions, there are much simpler solutions to hand for that, since the lexer scans source text from start to finish in order anyway.
Sorry, I've written you an essay again. If you're short on time, the important question is this one:
Should Span's current start and end be changed from UTF8 positions to UTF16 positions?
i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?
Sorry, I've written you an essay again. If you're short on time, the important question is this one:
Should Span's current start and end be changed from UTF8 positions to UTF16 positions? i.e. Is it just a mistake that they're currently UTF8 positions, and it should be rectified to make them UTF16?
I don't think so, everything in Rust land assumes UTF8.
For example when we emit diagnostic code frames, miette will slice into the source text by the spans we provide.
Sorry, I've written you an essay again.
I love these essays and discussions, keep up the good work!
I just found another use case for the line / column problem
https://github.com/oxc-project/oxc/blob/19577709dbcfc9761cfcb1b39ec7f3ebdc810285/crates/oxc_language_server/src/linter.rs#L392-L399
These problems are some low hanging fruits, we don't need to tackle these problems right now.
It's just good to know these problems exist when we code in Rust (UTF8), but these values will slip through to JavaScript (UTF16), which causes a conversion overhead.
I love these essays and discussions, keep up the good work!
Ha. OK, great!
these values will slip through to JavaScript (UTF16), which causes a conversion overhead.
This is actually the nub of my interest. I'd like to find a way to reduce the overhead of passing ASTs from Rust to JavaScript. I tried to solve that problem on SWC, but OXC is actually much more suited to a very performant solution due to its use of arenas. Anyway, I'll open an issue about that at some point.
This will soon be the only blocker for being able to use OXC as a parser for TS, I hope there is a way to to this in performant way!
This will soon be the only blocker for being able to use OXC as a parser for TS, I hope there is a way to to this in performant way!
@ArnaudBarre Is prettier getting the span positions from these two getters?
export const parsers: Plugin["parsers"] = {
typescript: {
astFormat: "estree",
parse: (text, options) => oxcParse(text, options.filepath),
locStart: (node) => node.start, // <---
locEnd: (node) => node.end, // <---
hasPragma: () => false,
},
};
Or do we need to rewrite all AST node spans?
If it only getting it from locStart
and locEnd
, is it possible for us to provide a slightly faster lookup data structure than
https://github.com/oxc-project/oxc/blob/1f3f21d722d22f97f20d9d2b47050dc86e185660/website/playground/index.js#L139-L139
https://github.com/oxc-project/oxc/blob/1f3f21d722d22f97f20d9d2b47050dc86e185660/website/playground/index.js#L706-L708
Or maybe we can add some kind of caching mechanism to the TextDecoder
method.
All I want is to get things running first, even with the slowest TextDecoder
approach 😅
Yeah we can do the remapping in this two methods it will work but it will probably kill the perf! But I agree, let's first have something working and then profile! Yeah I can try to have a small method that keep track of already computed utf-8/utf-16 mapping to avoid passing the whole file each time, I will see the perf impact for both and report this here, probably tonight!
As these ASTs are coming from the parser (not transformed code which could move things around), you'd expect while traversing the tree, if order of operations for each node is:
- Convert
span.start
- Visit node's children
- Convert
span.end
...then each offset will be larger than the last offset processed. So then you'd only need to TextDecoder
-ize the bytes in between current offset and previous offset.
However, I think TextDecoder().decode()
involves a call across JS/native boundary. If so, that'll have a sizeable cost in aggregate over 1000s of calls, regardless of how many bytes you decode each time.
Longer term, could we look at doing this in the Lexer?
2 possible options:
Lookup table
trait PosConversion {}
struct NoPosConversion;
impl PosConversion for NoPosConversion {}
struct Utf16PosConversion(Vec<(u32, u32)>);
impl PosConversion for Utf16PosConversion {}
struct Lexer<P: PosConversion> {
pos_lookup_table: P,
// ...
}
If P
is Utf16PosConversion
, lexer methods would populate the table.
(could also add LineColumnPosConversion
for a sourcemap-style lookup table)
Building the table in Lexer would be not too costly as it's already reading through the source and checking for Unicode bytes. Translating spans would be binary search (so reasonably fast), but would require traversing the AST in full, either in JS or in Rust.
Span type
enum SpanType {
Utf8,
Utf16,
LineColumn,
}
struct Lexer<const SPAN_TYPE: SpanType> {
// ...
}
This is more direct. If SPAN_TYPE == SpanType::Utf16
, span.start
and span.end
would be UTF-16 offsets. This would be faster in Lexer as no need to build a table, and zero cost on JS side (no need to traverse).
This would work perfectly for the use case of getting UTF-16 spans in JS AST. But maybe less easy to integrate with other parts of OXC.
I've been struggling with the question of how to deal with this for a while. To summarize my thoughts:
- The "Span type" solution above is definitely going to be most performant, and in a lot of ways the simplest.
- OXC parser (and probably other parts of OXC) currently rely on
Span
offsets being UTF-8 to slice the source text. But I believe we can find other ways to fulfil that function in OXC, without usingSpan
s for that. So thenSpan
offsets could be UTF-16 (or anything else!) with no problem. - External consumers of OXC in Rust may also want UTF-8 offsets.
@Boshen you've mentioned the 3rd point a few times, but the thing I'm struggling with is understanding for what purpose external consumers need this. It would really help if you could give some examples.
The TextEncoder thing is working. Didn't yet measure the perf impact, as soon as it was on I started getting new inetresting diff 😅. Sadly the current one involves a difference of AST for export import foo = bar
between TSESLint 5 and 6. This is not the first time I get frustrated of astexplorer lagging behind on parser versions, so I take my next evenings to build one (I need this also to quickly compare the ASTs of TSESLint & TS and go back to working on my TS linter 😅)
The TextEncoder thing is working. Didn't yet measure the perf impact, as soon as it was on I started getting new inetresting diff 😅. Sadly the current one involves a difference of AST for
export import foo = bar
between TSESLint 5 and 6. This is not the first time I get frustrated of astexplorer lagging behind on parser versions, so I take my next evenings to build one (I need this also to quickly compare the ASTs of TSESLint & TS and go back to working on my TS linter 😅)
Try this https://github.com/sxzz/ast-explorer at https://ast.sxzz.moe/
External consumers of OXC in Rust may also want UTF-8 offsets.
The linter is our external consumer, although it lives inside this repository. There are also people using the AST to build their own tools.
Regarding building an external data structure vs building the thing inside the lexer -
I'm drawing inspiration from how browsers do it
https://github.com/servo/tendril?tab=readme-ov-file#utf-16-compatibility
Solution: Use WTF-8 in parsing and in the DOM. Servo will convert to contiguous UTF-16 when necessary. The conversion can easily be parallelized, if we find a practical need to convert huge chunks of text all at once.
parallelized? huge chunks of text all at once? what do they all mean?
Anyway, I think it's a lot cleaner to build an external infrastructure first and see its performance impact, it may not be that bad ...
Thanks for coming back Boshen. More questions (sorry!):
The linter is our external consumer, although it lives inside this repository. There are also people using the AST to build their own tools.
Understood. But my question is: What purpose do these external tools require UTF-8 spans for? So, for example, in the linter:
- In what circumstances does it need to use
span.start
/span.end
, and needs them to be UTF-8 offsets? - Is this purely for extracting slices of source text? Or are there other uses?
- How common are such operations?
- How "hot" are those paths?
- Maybe what linter actually needs is not either exactly? (maybe for displaying lint errors, it really needs UTF-8 offset of line start + UTF-8 column offset?)
And:
- What other parts of OXC require UTF-8 offsets on a hot path? e.g. in transformer I think it'd only be for error messages (i.e. not hot)?
- Can you think of any use cases for UTF-8 offsets apart from slicing source text?
The context for these questions is that I'm wondering along these lines:
- There are common needs for UTF-16 offsets e.g. source maps.
- There are clearly some needs for UTF-8 offsets (but I'm not sure what they are).
- If the former is more common/hot than the latter, maybe we should flip it on it's head and:
-
Span
contains UTF-16 offsets. - Provide a mechanism (e.g. lookup table) to convert UTF-16 offsets to UTF-8.
-
- Or, if each use case requires only one of UTF-8 or UTF-16, we could satisfy both very performantly with a parameterized
Lexer<SpanType>
and consumer chooses which they want.
building an external data structure vs building the thing inside the lexer
If by "external data structure" you mean e.g. a UTF-8 -> UTF-16 lookup table, I don't see these 2 as mutually exclusive. We could build the external data structure in the lexer.
I'm drawing inspiration from how browsers do it
Servo's approach is interesting. Another example of prior art is V8, which I understand has quite a complex implementation for strings, with multiple internal representations (UTF-8/UTF-16 and contiguous/non-contiguous chunks stored as a tree). It uses whichever representation is most performant depending on how the string is originated, and then translates from one representation to another internally, depending on what the code using the string does. e.g. tree representation is good for lots of str += more
operations, contiguous UTF-16 representation is best for str[n]
.
However, I'm not sure how similar a browser's use case is to OXC's. What OXC needs to do with strings/offsets may be quite different (or maybe not). You've turned me on to the notion of data-driven design, hence all my questions about what use Span
offsets are being put to.
Anyway, I think it's a lot cleaner to build an external infrastructure first and see its performance impact, it may not be that bad ...
If you're willing, I'd like to try to clarify requirements as much as possible before implementation. Whichever way we go, implementation won't be entirely trivial, so in my view it'd be preferable for 1st iteration to be hopefully close to what we need in the end.
Boshen did an experiment to see what the performance hit of adding UTF-16 offset fields to Span
: #2626
Result was 2% - 3% regression on parser benchmarks. Not as bad as expected (well my expectation anyway).
Boshen asked there:
Does it work if we add offsets that are smaller than u32, and do utf16_start = start + offset, utf16_end = end + offset instead?
Nice idea. Maybe. The theoretical maximum for offset
is u32::MAX * 2 / 3
(if input was a 4 GiB JS file where every character is a 3-byte UTF-8 char). That'd be a pretty weird JS file, but legal according to the spec. So handling that would still need a u32
.
We could try some kind of compression:
- Use a smaller type for UTF8-UTF16 offset fields (e.g. 2 x
u16
), withu16::MAX
as a sentinel value which says "look up the actual value in a table" for weird cases.
or:
- Limit source text size to
u32::MAX / 2
(2 GiB), which would free up the highest bit instart
andend
. Use those high bits as flags for "UTF-16 offset is different from UTF-8 offset - look it up in table". ThenSpan
stays same size, and for JS files which are pure ASCII (probably a majority), the table would be empty.
However, I don't know if the extra operations and branches to compress and uncompress the values would hurt performance more than the simpler solution of just adding more fields to Span
.
What purpose do these external tools require UTF-8 spans for?
Current usages so far are just slicing text for diagnostics, which are all on the cold path.
Current usages so far are just slicing text for diagnostics, which are all on the cold path.
Thanks for coming back. Right, that does inform this a lot. Do you envisage the same is true for other external consumers?
Do you envisage the same is true for other external consumers?
I can't think of anything else besides source maps 🤔
OK great, this simplifies things a lot. I have an idea. Can't write it up now, but will try to later in the week.
I have a proposal for how to tackle this. Please feel free to disagree with any of this, but I thought it'd be useful at this stage to make a concrete proposal, as a basis for discussion.
Starting points
Primary use cases for different offset types:
- UTF-16 offsets are needed for JS AST (ESTree compat).
- UTF-8 offsets are needed for slicing source text in Rust for diagnostics.
- Line + column (column as UTF-16) needed for source maps.
Observations:
- Source maps are core feature of JS transpilation/minification, and creating them will be a hot path in
oxc_codegen
. - JS AST is (in my view) an important target.
- Diagnostics are a cold path.
- In some use cases, spans aren't needed at all (e.g. if app is just parsing files in order to find what other files they
import
, it doesn't care where in file theimport
statements are). - Calculating sourcemap-style line+column from UTF-16 offsets is a bit quicker than converting from UTF-8, but still requires significant extra work (building a table and then searching it).
Proposal for eventual solution
I propose a flexible solution which prioritizes performance of the main use cases which are on hot paths.
- Cater for all use cases by parameterizing
Lexer
andParserImpl
asLexer<SpanType>
andParserImpl<SpanType>
. - Consumer asks the parser for offsets be UTF-8, UTF-16, or line+column, as they require.
-
Span
contains only one offset type - you have to choose. - Public
Parser
API would not need to be generic (i.e.Parser
notParser<SpanType>
). User would call e.g.parser::utf16_spans()
and thenParser::parse
would create aParserImpl<SpanType::Utf16>
internally. - Requested offsets are calculated in the lexer, and parser stores them in
Span
s. - Consumer only pays for the conversions they need.
- Provide an interface to lazily convert UTF-16 offsets or line+column back to UTF-8. This can be used for diagnostics. Zero cost unless it's used.
Encoding offsets
Span
remains 2 x u32
s, regardless of offset type. Any of the following can be encoded as a single u32
:
- UTF-8 offset (plain
u32
) - UTF-16 offset (plain
u32
) - Line + column pair (compressed encoding packed into a
u32
)
Line + column encoding scheme
Line + column encoded as one of these 32-bit bit patterns:
-
00LLLLLLLLLLLLLLLLLLLLCCCCCCCCCC
=-
L
= Line (20 bits = 0 - 1048575 range) -
C
= Column (10 bits = 0 - 1023 range)
-
-
01CCCCCCCCCCCCCCCCCCCCCCCCLLLLLL
-
L
= Line (6 bits = 0 - 63 range) -
C
= Column (24 bits = 0 - 16777215 range)
-
-
1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Values to be found in external lookup table at index
I
(31 bits)
- Values to be found in external lookup table at index
This representation allows internal encoding of all possible line + column pairs for any source which is either:
- Less than 1 million lines, and no lines longer than 1023 chars - almost all "normal" JS sources.
- Less than 64 lines, and no lines longer than 16 million chars - minified files where entire file is a single line, up to file size 16 MiB.
These 2 should cover almost all real world cases. Any very weird cases (e.g. a few very long lines in a "normal" JS source) would be handled by storing line+column as a pair of u32
s in an external table.
Decoding should be cheap - just a single branch and a few bitwise operations for the most common case:
let (line, column) = match encoded & 0xC0000000 {
0 => (encoded >> 10, encoded & 0x3FF),
0x40000000 => (encoded & 0x3F, (encoded >> 6) & 0xFFFFFF),
_ => lookup_table[encoded as usize & 0x3FFFFFFF],
};
(and branch predictor will learn quickly which way to branch as it goes through a file)
Implications
-
Span
is always 2 xu32
s for any of above offset types. - No need to paramaterize all AST node types to support
Program<SpanType>
(horrible!). - Consumer can choose whichever offset type they need for their use case, and they only pay for the conversions they need.
- The fastest place to calculate UTF-16 offsets or line+column is going to be the lexer, where it's already spinning through the entire source byte by byte, and branching on Unicode characters and line breaks.
- Consuming offsets / line+column pairs is then free or very cheap.
- In particular this will be (I think) as good as we can get on speed for source maps. I believe it'll be a significant speed-up from what we have now.
Short term solution
The only hard part of the above to implement is line+column pairs. So in short term we could just support UTF-8 and UTF-16 spans, and leave line+column support until later.
Another option is to increase size of Span
to 32 bytes and include both UTF-8 and UTF-16 offsets in it. As Boshen's experiment showed, the size increase is not enormously costly (though the UTF-16 translation will also add a further cost). Personally I prefer the Lexer<SpanType>
solution, but Boshen you may disagree.
Any thoughts?
After going back and forth with all the requirements and solutions, I propose that we don't over-engineer this, keep it simple and accept the performance regression.
The tasks will be broken down to:
- add the utf16 spans to the current
Span
and make prettier work - build a line + column external data structure within the lexer and remove the line column generator from codegen
Removing "P-high" label as this is not a current area of work.