oxc regular_expression: report character span offsets not byte offsets

see body of https://github.com/oxc-project/oxc/pull/6129

Sep 28 '24 12:09 camchenry

@leaysgur Maybe you have some insight on this. I was having trouble correctly parsing control characters because the span offsets didn't match the source when unicode escapes were involved. I'm not certain how to fix the issue though.

Sep 28 '24 22:09 camchenry

Yes, I think I know what is going on here. I will write down the details with a little verification when the weekend is over.

Sep 29 '24 01:09 leaysgur

TL;DR

This is an issue related to "escape sequences" in JavaScript string literals.

When using the RegExp constructor instead of /regex/ literals
And if an escape sequence is included in the string literal

The corresponding position might not be correctly retrieved from the parsed Span.

Simplified

If there is a "file" containing:

"O\"X\"C"

and you want to slice each O, X, and C, you need to calculate the slice position like this.

fn main() {
    let source_text = std::fs::read_to_string("./test.js").unwrap();
    assert_eq!("O", &source_text[1..2]);
                 // &source_text[2..3]; \ 👈🏻
                 // &source_text[3..4]; "
    assert_eq!("X", &source_text[4..5]);
                 // &source_text[5..6]; \ 👈🏻
                 // &source_text[6..7]; "
    assert_eq!("C", &source_text[7..8]);
}

You need to include \ as part of the slicing index.

However, if you're just handling the "string":

fn main() {
    let source_text = "O\"X\"C";
    assert_eq!("O", &source_text[0..1]);
                 // &source_text[1..2]; "
    assert_eq!("X", &source_text[2..3]);
                 // &source_text[3..4]; "
    assert_eq!("C", &source_text[4..5]);
}

There's no need to consider the \.

This difference is causing problems when a diagnostic tries to slice the file.

Current status

As an implementation detail, oxc_regular_expression is not aware of these escape sequences.

In most cases, the current code works with an AST with oxc_parser's StringLiteral and its value is passed to oxc_regular_expression.

At this point from oxc_regular_expression, /O"X"C/ and new RegExp("O\"X\"C") seems equivalent, and new RegExp('O"X"C') is also the same. (Both "\"'`" and '"\'`' are normalized as "\"'`" in StringLiteral.value.)

Thus, the regular expression can be parsed without needing to be aware of these differences.

However, when attempting to slice the original code string using the parsed Span, issues may arise.

As seen in the original PR:

new RegExp('\\u{1111}*\\x1F', 'u')
            ^         ^

These 2 \ cause a misalignment as a result.

At that time, I was aware of this issue but was only focused on parsing.

https://github.com/oxc-project/oxc/pull/3824#issuecomment-2291127478 https://github.com/oxc-project/oxc/pull/5106

I was satisfied with oxc_parser resolving the escapes.

How to fix

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

When parsing a regular expression from the RegExp constructor, use string_literal.span.source_text(&self.source_text) instead of string_literal.value
- This is the entire string, including quotes
- Then pass inner_text, quote_type, span_offset?
In oxc_regular_expression:
- Introduce a lexer layer in front of the current Reader implementation
- Tokenize the text based on the quote type, similar to oxc_parser/lexer
  - https://github.com/oxc-project/oxc/blob/21b08ba14119774ea3421802bee210d3f7a35694/crates/oxc_parser/src/lexer/string.rs#L155
- Should consider template literals surrounded by ` too?
  - https://github.com/oxc-project/oxc/blob/21b08ba14119774ea3421802bee210d3f7a35694/crates/oxc_parser/src/lexer/template.rs#L17
- Provide a mode toggle to use the lexer or not?
  - When parsing a regex literal, lexing is unnecessary
  - Also unnecessary when using as a standalone regex parser

Are there any other approaches to consider...?

Sep 30 '24 05:09 leaysgur

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

This makes sense to me, I think it would it also align us with the eslint-community/regexpp parser package. Looking at the examples there, it seems like the spans represent the source text, not just its actual value. They also include the raw string for reference, though I don't think we need to do that.

https://github.com/eslint-community/regexpp/blob/4d4787a2479b5c7f4984227c40123749898fb8a2/test/fixtures/parser/literal/basic-valid-2015-u.json#L6133-L6176

Should consider template literals surrounded by ` too?

I don't think we need to handle this necessarily. The ESLint parser doesn't bother to even check template literals for most rules it seems:

Sep 30 '24 17:09 camchenry

I agree, we need to make it parse from raw string instead of the interpreted string value.

Oct 01 '24 02:10 Boshen

Thank you both. 👍🏻

I push_front this task to my task queue. 📚

Oct 01 '24 03:10 leaysgur

OK..., this might not be as easy as expected.

Does this mean we need oxc_string_literal::Parser to fix this strictly? 🤔

https://tc39.es/ecma262/2024/#prod-StringLiteral

Oct 03 '24 07:10 leaysgur

Progress updates:

[x] Implement StringLiteral parser
- [x] Basic implementation
- [x] Fix up with more tests
[x] Handle /literal/ and new RegExp("string") equally
[x] Update Span handling

Oct 15 '24 09:10 leaysgur

Closes by #6635 and #6741 .

@camchenry If you are planning to do something, please refer to this. 😉

https://github.com/oxc-project/oxc/blob/8032813bf85044222c0f8b343a648f7f3648845a/crates/oxc_linter/src/rules/eslint/no_control_regex.rs#L129-L137

Oct 22 '24 05:10 leaysgur