lexgen
lexgen copied to clipboard
A fully-featured lexer generator, implemented as a proc macro
I need to match a token which contains unicode scalar values in the categories L, M, N, P, S and Cf. I see three different ways to solve this: 1....
Recently I debugged a lexer with this rule: ``` ("0b" | "0o" | "0x")? ($digit | '_')* $id? = ..., ``` This regex accepts empty string, so the lexgen state...
For the algorithm, in addition to dragon book, there's a paper "Fast brief practical DFA minimization" which is paywalled but available on sci-hub. (doi:10.1016/j.ipl.2011.12.004) (edit: also available here https://www.cs.cmu.edu/~cdm/papers/Valmari12.pdf) A...
If we generate the DFA directly without going through NFA: - Should be more efficient - Should generate a slightly better DFA
This would improve performance, as no utf8 decoding is necessary. This is what re2 does too.
Suppose I'm trying to lex this invalid Rust code: `b"\xa"`. The problem here is `\x` needs to be followed by two hex digits, not one. If I run this with...
Sometimes I want a lexer rule to be able to return multiple tokens, e.g. to emit a dummy token so parser can use it as an end-marker for some syntax....
Currently we have these transitions in NFAs: ```rust struct State { char_transitions: Map, range_transitions: RangeMap, empty_transitions: Set, any_transitions: Set, end_of_input_transitions: Set, ... } ``` (I was confused for a few...
In the Lua lexer I see code like ```rust '>' => { self.0.set_accepting_state(Lexer_ACTION_13); // 2 match self.0.next() { None => { self.0.__done = true; match self.0.backtrack() { // 6 ......
Some of the search tables for built-in unicode regular expressions are quite large, but I think 99.9999% of the time they will match ASCII characters, so we should implement a...