oxc icon indicating copy to clipboard operation
oxc copied to clipboard

Linter plugins: Token-related `SourceCode` APIs

Open overlookmotel opened this issue 2 months ago • 7 comments

As noted in JS Plugins Milestone 2 (#14826), our first priority is to implement all APIs related to "fix code"-type linter plugins.

Later on, we'll want to also support stylistic/formatting rules, which rely on Token-related APIs in SourceCode. These methods are currently stubbed out in tokens.ts.

JS implementation

Unlike many other APIs, I am unclear if it's feasible to implement the Token-related APIs by pulling in a JS dependency.

@typescript-eslint/parser does provide an array of Tokens from its parser, but it uses TypeScript's parser under the hood. Parsing every file again on JS side using TypeScript would be really slow.

espree (ESLint's parser) has a tokenize() method. That doesn't parse, only tokenizes, so its performance might be acceptable (according to Claude, anyway). However, will it work for TS? Espree can't parse TS, but maybe that doesn't matter when you're only asking it to tokenize? (Claude says yes, but I don't entirely believe him!)

If Espree can work for us, I think this should be our first port of call, in interests of getting a working implementation out quickly.

Laziness

Like we do with comments, we'd want to lazily call tokenize(), only when one of the token-related APIs is first used, so that this work is skipped when no JS rule requires tokens.

Rust implementation

This will be what we'll want to do in the end, as it'll be faster. However, it's challenging.

Altering the lexer

Oxc's lexer generates a stream of Tokens, however, they're not stored anywhere. We'd need to adapt it to also store all the Tokens in an oxc_allocator::Vec in the arena. That Vec can then be shared with JS through the same "raw transfer" mechanism as is used for the AST.

One difficulty is that the lexer sometimes has to unwind to where syntax is ambiguous reading forwards through the file. e.g. is (a, b) a SequenceExpression, or the start of an arrow function (a, b) => ...? When rewinding occurs, we'd need to also rewind the stream of Tokens so they don't get included twice in these cases.

Avoiding perf hit

Oxc's parser is not only used in the linter - it's the first step of the parse-transform-minify-print and parse-format pipelines. Its performance in these pipelines is critical, and hurting perf while implementing what we need for linter would be unacceptable.

We'll need some kind of conditional compilation, so linter uses a different version of the parser from other pipelines.

Cargo features are problematic, due to feature unification in tests, so I would propose an approach using generics:

// ----- Traits -----

trait ParserConfig {
    type Tokens<'a>: TokenStore<'a>;
}

trait TokenStore<'a> {
    type Checkpoint;

    fn new(allocator: &'a Allocator) -> Self;
    fn push(token: Token);
    fn checkpoint(&self) -> Self::Checkpoint;
    fn rewind(&mut self, checkpoint: Self::Checkpoint);
}

// ----- parse-transform-minify-print pipeline -----

struct StandardParserConfig;

impl ParserConfig for StandardParserConfig {
    type Tokens<'a> = NoopTokenStore;
}

struct NoopTokenStore;

impl<'a> TokenStore<'a> for NoopTokenStore {
    type Checkpoint = ();

    fn new(_allocator: &'a Allocator) -> Self { Self }
    fn push(_token: Token) {}
    fn checkpoint(&self) {}
    fn rewind(&mut self, _checkpoint: ()) {}
}

// ----- Oxlint -----

struct OxlintParserConfig;

impl ParserConfig for OxlintParserConfig {
    type Tokens<'a> = RealTokenStore<'a>;
}

struct RealTokenStore<'a> {
    tokens: Vec<'a, Token>,
}

impl<'a> TokenStore<'a> for RealTokenStore<'a> {
    type Checkpoint = u32;

    fn new(allocator: &'a Allocator) -> Self {
        Self { tokens: Vec::new_in(allocator) }
    }
    fn push(token: Token) {
        self.tokens.push(token);
    }
    fn checkpoint(&self) -> u32 {
        self.tokens.len() as u32
    }
    fn rewind(&mut self, checkpoint: u32) {
        self.tokens.truncate(checkpoint as usize);
    }
}

// Parser becomes generic over `ParserConfig`

pub struct Parser<'a, Config: ParserConfig> {
    // ...
    marker: PhantomData<Config>,
}

struct ParserImpl<'a, Config: ParserConfig> {
    // ...
    tokens: Config::Tokens<'a>,
}

impl<'a, Config: ParserConfig> ParserImpl<'a, Config> {
    fn somewhere_in_parser_or_lexer(&mut self, token: Token) {
        // This is a complete no-op when `Config` is `StandardParserConfig`
        self.tokens.push(token);
    }
}

Token type

Important note: We do not want to increase the size of the Rust Token type unless absolutely unavoidable. Keeping Token as a single u128 is critical for performance.

Why the additional ParserConfig abstraction?

The above code example would work without ParserConfig. Parser and ParserImpl could both be generic over TokenStore, and we could "cut out the middleman" of ParserConfig - it adds no further value.

Reason for ParserConfig is that I think it could be useful for other purposes further down the line. e.g. we'll want at some point fairly soon to move construction of the UTF-8 to UTF-16 translation table into lexer. We'd want that to be generic too, so the cost of that is only paid by the linter. We can add that to ParserConfig to avoid 2 generic params everywhere Parser<T: TokenStore, U: UtfTranslationTable>.

It'd be nice if we can make that change later on without loads of code churn - introducing ParserConfig now enables that.

Translation to ESLint Token format

Similar to how we had to build a translation mechanism from Oxc's Rust AST to ESTree standard, we'll likely need to translate from Oxc's Token format to ESLint's format. What's needed here requires research.

The big question is: Our current Token type doesn't include the content of the tokens, only their start and end position, the token's Kind (type), and some flags. Can we avoid altering Token type and get the token content (e.g. the content of a string literal token) on JS side by slicing the source text using start and end?

overlookmotel avatar Oct 20 '25 19:10 overlookmotel

This will not be planned for Milestone 2.

Let's survey the plugins that require tokens before starting.

espree (ESLint's parser) has a tokenize() method. That doesn't parse, only tokenizes

This is not going to work because js syntax require parsing context to produce the correct token in ambiguous places such as /.

Boshen avatar Oct 21 '25 01:10 Boshen

This will not be planned for Milestone 2.

Let's survey the plugins that require tokens before starting.

espree (ESLint's parser) has a tokenize() method. That doesn't parse, only tokenizes

This is not going to work because js syntax require parsing context to produce the correct token in ambiguous places such as /.

This would be good to have for Milestone 2 (or soon after) as it's a blocker for eslint-plugin-react-hooks

matthewdavi avatar Oct 23 '25 13:10 matthewdavi

I looked at popular eslint plugins that haven't been ported to rust. Though there were some methods that I did not see used even once, 9/10 plugins I found used some token-based method.

eslint plugin perfectionist

Downloads: 1.2M a week Rules: sort-imports, sort-switch-case APIs: getTokensBetween, getTokenAfter (code search)

eslint-plugin-sonarjs

Downloads: 1.5M a week Rules: 62 APIs: getTokens, getToken(s)Before, getToken(s)Between, getToken(s)After, getFirstToken, getLastToken (code search)

eslint-plugin-vue

Downloads: 4.3M a week Rules: 33 APIs: getTokensBetween, getTokenBefore, getTokenAfter, getFirstToken, getLastToken, getTokens, getFirstTokensBetween (code search)

eslint-plugin-svelte

Downloads: 584k a week Rules: 17 APIs: getFirstToken(s), getLastToken, getTokenAfter, getTokens, getTokenBefore, getTokensBetween (code search)

eslint-plugin-node

Downloads: 3.5M a week Rules: exports-style, no-unsupported-features APIs: getTokenAfter, getLastToken (code search)

eslint-plugin-security

Donwloads: 1.3M a week Rules: detect-unsafe-regex, detect-no-csrf-before-method-override APIs: getTokens(code search)

eslint-plugin-ember

Downloads: 219k a week Rules: 6 APIs: getTokens, getTokenAfter, getFirstToken, getLastToken (code search)

eslint-plugin-testing-library

Downloads: 5.6M a week Rules: no-await-sync-queries, no-wait-for-side-effects APIs: getFirstToken, getTokenAfter (code search)


I looked at espree briefly. It keeps the tokens ambiguous as "punctuators" when given ambigious syntax, and that's what plugins have come to expect from the tokens API. However, unlike what claude said, it throws on TypeScript syntax. I don't think it's viable.

lilnasy avatar Oct 24 '25 17:10 lilnasy

Assigned to myself so I can schedule a time to think about this problem.

Boshen avatar Oct 25 '25 12:10 Boshen

It looks like we can get token values by slicing the source text, except in 1 weird case (escaped identifiers).

This source:

123;
/_\n_/u;
"_\n_";
`_\n_${x}_\n_`;
_\u{24}_;

Produces tokens:

[
  { type: 'Numeric', value: '123', range: [ 0, 3 ] },
  { type: 'Punctuator', value: ';', range: [ 3, 4 ] },
  { type: 'RegularExpression', value: '/_\\n_/u', range: [ 5, 12 ], regex: { flags: 'u', pattern: '_\\n_' } },
  { type: 'Punctuator', value: ';', range: [ 12, 13 ] },
  { type: 'String', value: '"_\\n_"', range: [ 14, 20 ] },
  { type: 'Punctuator', value: ';', range: [ 20, 21 ] },
  { type: 'Template', value: '`_\\n_${', range: [ 22, 29 ] },
  { type: 'Identifier', value: 'x', range: [ 29, 30 ] },
  { type: 'Template', value: '}_\\n_`', range: [ 30, 36 ] },
  { type: 'Punctuator', value: ';', range: [ 36, 37 ] },
  { type: 'Identifier', value: '_$_', range: [ 38, 46 ] },
  { type: 'Punctuator', value: ';', range: [ 46, 47 ] }
]

value is always a source slice, except for escaped Identifiers, and the only other oddity is the regex property on RegularExpression tokens.

Regexps

Slice source text, find last /, split into flags and pattern.

Escaped identifiers

Store unescaped identifier strings in a side table, and store the index in the side table in spare bits in Token.

Escaped identifiers are extremely rare. The maximum number of them theoretically possible is if source text was u32::MAX bytes long (the max we support), consisting entirely of \u{24},\u{24},\u{24},.... 7 bytes per repetition = 613,566,756 identifiers. So max escaped identifier count requires 30 bits to store.

Token currently has 24 bits spare. We could either:

  1. Compact 2 of the boolean flags to free up another 8 bits. or
  2. Use only 24 bits, and crash if the source contains more than 1 << 24 (16 million) escaped identifiers. Surely that's never going to happen in practice!

Change how escaped identifiers are stored in lexer?

In lexer, we currently we store unescaped strings and identifiers in HashMap<u32, &str>s keyed by span.start, and set a has_escape flag in Token. When has_escape is true, the parser pulls the string back out of the hashmap.

We could change that to:

  • Store unescaped strings and identifiers in Vecs instead of HashMaps.
  • Store index into relevant Vec as a u32 in Token (replacing the has_escape flag would leave 32 bits free).
  • Use index 0 as a sentinel value meaning "not escaped" (put 1 dummy entry into each Vec at the start so 0 is not a real index).

Replacing the HashMaps with Vecs will likely be more performant, so this should be a gain, independent of Oxlint plugins. But conveniently, the same Vec would also act as our unescaped identifier "side table" in plugins.

Conclusion

So it looks like we can store all the data we need in Token, without increasing its size.

overlookmotel avatar Nov 08 '25 18:11 overlookmotel

Need to research on whether these tokens match between our parser and what eslint uses, before make any changes to our Rust code.

Boshen avatar Nov 10 '25 16:11 Boshen

From @lilnasy's research it's clear that lack of token APIs is going to be a blocker to a lot of ESLint plugins working, not just formatting plugins.

But getting our Rust parser to generate tokens, and translating them from Oxc's Rust format to ESLint's format, is going to take some time. There will likely be complications.

We want to get to the point where majority of ESLint rules run correctly in Oxlint as quickly as we can.

After some consideration, here's what Boshen and I think we should do:

Phase 1: Slow but working

  • Pull in @typescript-eslint/parser as a dependency (or bundle it).
  • Use it to parse and generate an array of Tokens.
  • Trigger TS-ESLint's parsing lazily, only when tokens property on Program is accessed, or when one of the token APIs is used.

We may also want to investigate if espree is faster. If so, we could use Espree for JS files, and TS-ESLint for TS files. Note: I'm referring to Espree's parse() method, NOT tokenize() - tokenize won't work, for the reasons Boshen gave above.

Phase 2: Rustify

Re-implement generating tokens on Rust side, using something like the scheme I've laid out above.

As Boshen said above, before we embark on that, we should research to make sure we're going to get the result we need.

overlookmotel avatar Nov 10 '25 16:11 overlookmotel

It seems TS-ESLint parser diverges from ESLint in tokens for escaped identifers:

Source:

\u{41}

Espree (ESLint's parser):

{
  "type": "Identifier",
  "value": "A",
  "range": [0, 6]
}

AST explorer

TS-ESLint parser:

{
  "type": "Identifier",
  "value": "\\u{41}",
  "range": [0, 6]
}

AST explorer

TS-ESLint have decided this is a case of "won't fix", so if we decide to align with TS-ESLint, we don't have to worry about escaped identifiers either.

overlookmotel avatar Nov 19 '25 16:11 overlookmotel

The "slow but working" implementation is on its tail end of development.

The APIs are built, we have to build the confidence next. This would be trying it out with popular plugins, adding tests, and small refactors and optimizations - all necessary for it to become a good foundation for the rust implementation.

  • [x] getTokens #15861
  • [x] getFirstToken #16002
  • [x] getFirstTokens#15976
  • [x] getLastToken #16003
  • [x] getLastTokens #16000
  • [x] getTokenBefore #15956
  • [x] getTokenOrCommentBefore #16044
  • [x] getTokensBefore #15955
  • [x] getTokenAfter #15973
  • [x] getTokenOrCommentAfter #16045
  • [x] getTokensAfter #15971
  • [x] getTokensBetween #16034
  • [x] getFirstTokenBetween #16032
  • [x] getFirstTokensBetween #16019
  • [x] getLastTokenBetween #16008
  • [x] getLastTokensBetween #16033
  • [x] getTokenByRangeStart #16043
  • [x] isSpaceBetween reimplementation with token APIs (JSX is surprisingly tricky!) #16055

A few things that came up during development:

  • For the performance-code size tradeoff, it would be nice to have a way to inline functions. https://github.com/oxc-project/oxc/pull/15955#discussion_r2553215901
  • We need to marshal from the rust side whether the current file is JSX. This is important for testing isSpaceBetween as well. https://github.com/oxc-project/oxc/pull/15861#discussion_r2542850640. The codegen is kinda messy, it'll be fun! 🌈

lilnasy avatar Nov 24 '25 11:11 lilnasy

All token APIs are implemented! 🥳

Have opened a separate issue #16207 for Rustification...

overlookmotel avatar Nov 27 '25 13:11 overlookmotel