regular_expression: report character span offsets not byte offsets
see body of https://github.com/oxc-project/oxc/pull/6129
@leaysgur Maybe you have some insight on this. I was having trouble correctly parsing control characters because the span offsets didn't match the source when unicode escapes were involved. I'm not certain how to fix the issue though.
Yes, I think I know what is going on here. I will write down the details with a little verification when the weekend is over.
TL;DR
This is an issue related to "escape sequences" in JavaScript string literals.
- When using the
RegExpconstructor instead of/regex/literals - And if an escape sequence is included in the string literal
The corresponding position might not be correctly retrieved from the parsed Span.
Simplified
If there is a "file" containing:
"O\"X\"C"
and you want to slice each O, X, and C, you need to calculate the slice position like this.
fn main() {
let source_text = std::fs::read_to_string("./test.js").unwrap();
assert_eq!("O", &source_text[1..2]);
// &source_text[2..3]; \ 👈🏻
// &source_text[3..4]; "
assert_eq!("X", &source_text[4..5]);
// &source_text[5..6]; \ 👈🏻
// &source_text[6..7]; "
assert_eq!("C", &source_text[7..8]);
}
You need to include \ as part of the slicing index.
However, if you're just handling the "string":
fn main() {
let source_text = "O\"X\"C";
assert_eq!("O", &source_text[0..1]);
// &source_text[1..2]; "
assert_eq!("X", &source_text[2..3]);
// &source_text[3..4]; "
assert_eq!("C", &source_text[4..5]);
}
There's no need to consider the \.
This difference is causing problems when a diagnostic tries to slice the file.
Current status
As an implementation detail, oxc_regular_expression is not aware of these escape sequences.
In most cases, the current code works with an AST with oxc_parser's StringLiteral and its value is passed to oxc_regular_expression.
At this point from oxc_regular_expression, /O"X"C/ and new RegExp("O\"X\"C") seems equivalent, and new RegExp('O"X"C') is also the same. (Both "\"'`" and '"\'`' are normalized as "\"'`" in StringLiteral.value.)
Thus, the regular expression can be parsed without needing to be aware of these differences.
However, when attempting to slice the original code string using the parsed Span, issues may arise.
As seen in the original PR:
new RegExp('\\u{1111}*\\x1F', 'u')
^ ^
These 2 \ cause a misalignment as a result.
At that time, I was aware of this issue but was only focused on parsing.
https://github.com/oxc-project/oxc/pull/3824#issuecomment-2291127478 https://github.com/oxc-project/oxc/pull/5106
I was satisfied with oxc_parser resolving the escapes.
How to fix
In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.
- When parsing a regular expression from the
RegExpconstructor, usestring_literal.span.source_text(&self.source_text)instead ofstring_literal.value- This is the entire string, including quotes
- Then pass
inner_text, quote_type, span_offset?
- In
oxc_regular_expression:- Introduce a lexer layer in front of the current
Readerimplementation - Tokenize the text based on the quote type, similar to
oxc_parser/lexer- https://github.com/oxc-project/oxc/blob/21b08ba14119774ea3421802bee210d3f7a35694/crates/oxc_parser/src/lexer/string.rs#L155
- Should consider template literals surrounded by
`too?- https://github.com/oxc-project/oxc/blob/21b08ba14119774ea3421802bee210d3f7a35694/crates/oxc_parser/src/lexer/template.rs#L17
- Provide a mode toggle to use the lexer or not?
- When parsing a regex literal, lexing is unnecessary
- Also unnecessary when using as a standalone regex parser
- Introduce a lexer layer in front of the current
Are there any other approaches to consider...?
In my view, the only solution is to update
oxc_regular_expressionto be aware of escape sequences.
This makes sense to me, I think it would it also align us with the eslint-community/regexpp parser package. Looking at the examples there, it seems like the spans represent the source text, not just its actual value. They also include the raw string for reference, though I don't think we need to do that.
https://github.com/eslint-community/regexpp/blob/4d4787a2479b5c7f4984227c40123749898fb8a2/test/fixtures/parser/literal/basic-valid-2015-u.json#L6133-L6176
- Should consider template literals surrounded by
`too?
I don't think we need to handle this necessarily. The ESLint parser doesn't bother to even check template literals for most rules it seems:
I agree, we need to make it parse from raw string instead of the interpreted string value.
Thank you both. 👍🏻
I push_front this task to my task queue. 📚
OK..., this might not be as easy as expected.
Does this mean we need oxc_string_literal::Parser to fix this strictly? 🤔
https://tc39.es/ecma262/2024/#prod-StringLiteral
Progress updates:
- [x] Implement StringLiteral parser
- [x] Basic implementation
- [x] Fix up with more tests
- [x] Handle
/literal/andnew RegExp("string")equally - [x] Update
Spanhandling
Closes by #6635 and #6741 .
@camchenry If you are planning to do something, please refer to this. 😉
https://github.com/oxc-project/oxc/blob/8032813bf85044222c0f8b343a648f7f3648845a/crates/oxc_linter/src/rules/eslint/no_control_regex.rs#L129-L137