ctpg
ctpg copied to clipboard
Can you explain how to use the lexer stand-alone?
i.e. I want to use it much the way I use http://benhanson.net/lexertl.html (ignore all Unicode etc for now, see the Examples)
There is no official standalone lexer feature. There is a way, but I'm not convinced it should be official in its current state. If you want you can look under the hood and see how regex parser implements it's lexer.
Or did I just completely misunderstood you and you want to just use the regex lexer without the actual parser? In this case you are out of luck. You could theoretically create a huge regex with all lexical tokens (just separate them with | ) like this: (token_1_regex)|(token_2_regex) but unfortunately there is no support for grouping captures. You won't know which sub-regex was actually matched. I'm open implementing this feature, this would take some time though.
Yes, I was hoping to be able to use the lexer without the parser. If you have the ability for your regexes to have numeric ids you just include that in the end state for each regex. You resolve ambiguity by only setting the id in an end state if one has not already been set. (As you hinted at your lexer generator should or all the regexes together)
I guess I could do a standalone lexer feature, it shouldn't be too hard to expose some interface for that. Like you said I already am creating a DFA from all the terminal symbols. It would be a lexer class that resemble a parser interface but simpler, just the terms(...) call is enough. And then a 'match' method accepting same kinds of arguments (buffer, options, error stream) and returning an index and a string view of a matched text.
There are a couple of conventions:
- If there is no match for the lexer at the current position, you usually return a single character (I return an id of ~0 in this case too). This allows you to continue lexing.
- It is customary to return 0 for End of Input.