ctpg icon indicating copy to clipboard operation
ctpg copied to clipboard

Can you explain how to use the lexer stand-alone?

Open BenHanson opened this issue 2 years ago • 5 comments

i.e. I want to use it much the way I use http://benhanson.net/lexertl.html (ignore all Unicode etc for now, see the Examples)

BenHanson avatar Dec 27 '21 17:12 BenHanson

There is no official standalone lexer feature. There is a way, but I'm not convinced it should be official in its current state. If you want you can look under the hood and see how regex parser implements it's lexer.

peter-winter avatar Dec 27 '21 17:12 peter-winter

Or did I just completely misunderstood you and you want to just use the regex lexer without the actual parser? In this case you are out of luck. You could theoretically create a huge regex with all lexical tokens (just separate them with | ) like this: (token_1_regex)|(token_2_regex) but unfortunately there is no support for grouping captures. You won't know which sub-regex was actually matched. I'm open implementing this feature, this would take some time though.

peter-winter avatar Dec 27 '21 18:12 peter-winter

Yes, I was hoping to be able to use the lexer without the parser. If you have the ability for your regexes to have numeric ids you just include that in the end state for each regex. You resolve ambiguity by only setting the id in an end state if one has not already been set. (As you hinted at your lexer generator should or all the regexes together)

BenHanson avatar Dec 27 '21 21:12 BenHanson

I guess I could do a standalone lexer feature, it shouldn't be too hard to expose some interface for that. Like you said I already am creating a DFA from all the terminal symbols. It would be a lexer class that resemble a parser interface but simpler, just the terms(...) call is enough. And then a 'match' method accepting same kinds of arguments (buffer, options, error stream) and returning an index and a string view of a matched text.

peter-winter avatar Dec 28 '21 08:12 peter-winter

There are a couple of conventions:

  • If there is no match for the lexer at the current position, you usually return a single character (I return an id of ~0 in this case too). This allows you to continue lexing.
  • It is customary to return 0 for End of Input.

BenHanson avatar Dec 28 '21 16:12 BenHanson