oxc icon indicating copy to clipboard operation
oxc copied to clipboard

Convert simple byte handlers in lexer to branchless code

Open overlookmotel opened this issue 9 months ago • 8 comments

Why the lexer is slower than it needs to be

Many of the simpler byte handlers in the lexer perform multiple branches.

e.g. The handler for = which needs to figure out "is the whole token =, ==, ===, or =>?"

https://github.com/oxc-project/oxc/blob/b9d69ad665ab386233f940e9f1a9b3095a2238dc/crates/oxc_parser/src/lexer/byte_handlers.rs#L389-L403

Typical JS code will contain a random mix of assignments x = 1, equality checks x === 1, and arrow functions =>.

Because of the random distribution, the branch predictor won't be able to predict what the token is going to be, and so which way these branches are going to go. It will likely get it wrong at least 50% of the time. Branch mis-predication is fairly expensive (I think I read about 20 CPU cycles each time), and these functions are on a hot path because =, ===, and => are all common in JS code.

Make it branchless!

We could convert these matchers to branchless code with something like this:

// Set the discriminants for `Kind` to always follow the pattern
// that `discriminant % 4` is number of bytes this token takes
const EQ_BASE: u8 = 64; // or whatever multiple of 4

#[derive(PartialEq, Clone, Copy)]
#[repr(u8)]
pub enum Kind {
    /// `=`
    Eq = EQ_BASE + 1,
    /// `==`
    Eq2 = EQ_BASE + 2,
    /// `===`
    Eq3 = EQ_BASE + 3,
    /// `=>`
    Arrow = EQ_BASE + 4 + 2,

    // ...all the other token kinds
}

ascii_byte_handler!(EQL(lexer) { 
    const fn num_bytes(kind: Kind) -> usize { kind as usize % 4 }

    // Check the invariant about discrimants mentioned above at compile time
    const_assert!(num_bytes(Kind::Eq) == "=".len());
    const_assert!(num_bytes(Kind::Eq2) == "==".len());
    const_assert!(num_bytes(Kind::Eq3) == "===".len());
    const_assert!(num_bytes(Kind::Arrow) == "=>".len());

    // NB: For simplicity, not handling case where close to EOF
    // and there are not 2 more bytes remaining in source to read.
    // Would need to handle that case separately, in a `#[cold]` branch.

    let next2: [u8; 2] = unsafe { lexer.source.position().add(1).read2() };
    let kind = if next2 == [b'=', b'='] {
        Kind::Eq3
    } else {
        match next2[0] {
            b'=' => Kind::Eq2,
            b'>' => Kind::Arrow,
            _ => Kind::Eq,
        }
    };

    unsafe { lexer.source.advance(num_bytes(kind)); }

    kind
});

The trick is to make Kind play two roles. It's the token kind, but it also encodes how many bytes the token is. This is what allows to compiler to use cmovne instructions rather than branches.

Godbolt shows this compiles down to branchless code: https://godbolt.org/z/5W7qjfqMq

Note on benchmarking

We need to alter the benchmarking setup before we do this. CodSpeed have advised that the way their benchmarking works, it doesn't take into account branch mis-prediction.

So, even if this change makes a solid difference in real world (which I think it could), it won't show up on our CodSpeed benchmarks.

We need some way to check that this is improving performance in the way that we think it is.

overlookmotel avatar May 15 '24 13:05 overlookmotel