Make DFA match utf8 rather than char
This would improve performance, as no utf8 decoding is necessary.
This is what re2 does too.
A simple way to implement this would be to replace chars in the DFA (and NFA, code generator) with u8. When a char is encoded as multiple bytes introduce new states to match the whole encoding. E.g. if I have this transition
S1 --('ö')--> S2
(state S1 switches to S2 on character 'ö')
This becomes
S1 --(0xC3)--> Sn --(0xB6)--> S2
(state S1 switches to a new intermediate step Sn on byte 0xC3, which then switches to S2 on byte 0xB6)
A problem with this is that the number of states will increase proportional to a matched string/char's UTF-8 encoding, rather than the number of chars. Maybe that's not too bad though, and I don't know if it's possible to avoid new states.
How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?
Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).
How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?
Good point, I think regexes like $$XID_Start will probably make things unacceptably slow in compile time and cause huge code generation.
Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).
Currently we can't really reuse DFAs (or states) as each DFA/state will have a different next state or semantic action (i.e. continuation).
For example, we have "match A then do B" and "match A then do C", the "match A" machines cannot be reused as each machine will do something different after a successful match.
I guess we should keep the DFA and NFA the same and find another way. Do you know how re2 doing this?
Btw, for runtime performance, I think this change will probably be a micro-optimization compared to DFA minimization (#38). We should do that first. It will also reduce code size.