lexgen issues

Question about unicode categories

3

I need to match a token which contains unicode scalar values in the categories L, M, N, P, S and Cf. I see three different ways to solve this: 1....

nalply

Consider generating a warning (with an option to suppress) for rules that accept empty string

Recently I debugged a lexer with this rule: ``` ("0b" | "0o" | "0x")? ($digit | '_')* $id? = ..., ``` This regex accepts empty string, so the lexgen state...

osa1

Implement DFA minimization

For the algorithm, in addition to dragon book, there's a paper "Fast brief practical DFA minimization" which is paywalled but available on sci-hub. (doi:10.1016/j.ipl.2011.12.004) (edit: also available here https://www.cs.cmu.edu/~cdm/papers/Valmari12.pdf) A...

osa1

perf

code size

Generate DFA directly (no NFA)

If we generate the DFA directly without going through NFA: - Should be more efficient - Should generate a slightly better DFA

seanyoung

perf

Make DFA match utf8 rather than char

3

This would improve performance, as no utf8 decoding is necessary. This is what re2 does too.

seanyoung

perf

Implement "cut" (as in Prolog) for comitting to a choice

1

Suppose I'm trying to lex this invalid Rust code: `b"\xa"`. The problem here is `\x` needs to be followed by two hex digits, not one. If I run this with...

osa1

feature

design

Return multiple tokens

1

Sometimes I want a lexer rule to be able to return multiple tokens, e.g. to emit a dummy token so parser can use it as an end-marker for some syntax....

MiSawa

feature

Consider merging char and range transitions in NFA and DFA

Currently we have these transitions in NFAs: ```rust struct State { char_transitions: Map, range_transitions: RangeMap, empty_transitions: Set, any_transitions: Set, end_of_input_transitions: Set, ... } ``` (I was confused for a few...

osa1

refactoring

Avoid redunant backtrack state updates

1

In the Lua lexer I see code like ```rust '>' => { self.0.set_accepting_state(Lexer_ACTION_13); // 2 match self.0.next() { None => { self.0.__done = true; match self.0.backtrack() { // 6 ......

osa1

perf

Implement fast paths for ASCII in unicode regexes

Some of the search tables for built-in unicode regular expressions are quite large, but I think 99.9999% of the time they will match ASCII characters, so we should implement a...

osa1

perf

lexgen
lexgen copied to clipboard

Metadata

Question about unicode categories

Consider generating a warning (with an option to suppress) for rules that accept empty string

Implement DFA minimization

Generate DFA directly (no NFA)

Make DFA match utf8 rather than char

Implement "cut" (as in Prolog) for comitting to a choice

Return multiple tokens

Consider merging char and range transitions in NFA and DFA

Avoid redunant backtrack state updates

Implement fast paths for ASCII in unicode regexes

← Metadata

Owner

Metadata

lexgen lexgen copied to clipboard

Metadata

← Metadata

Owner

Metadata

lexgen
lexgen copied to clipboard