lime icon indicating copy to clipboard operation
lime copied to clipboard

Tokenizer

Open AlgorithMan-de opened this issue 9 years ago • 12 comments

I wrote a very very cheap lexer/tokenizer generator in BASH to accompany lime (took the manual one from your calc example as reference). Maybe you want to put it in? If you preferred it to be written in php - it should be translatable easily - it mostly consists of echos. The main advantage is that it selects first longest matches, not first matches.

pflex.txt calc.txt

AlgorithMan-de avatar Nov 16 '16 09:11 AlgorithMan-de

Here, I translated it to php pflex.txt

AlgorithMan-de avatar Nov 16 '16 09:11 AlgorithMan-de

With the generated tokenizer, the main calc.php becomes as simple as this calc.txt

AlgorithMan-de avatar Nov 16 '16 14:11 AlgorithMan-de

I don't understand the advantage of selecting the longest match. My own tokenize function evaluates the regular expressions in a logical order without the chance of getting the wrong regex because of different lengths. It is also possible to split functions calls into smaller tokens. E.g. I implemented the syntax IFERROR( exp ; exp ) in the grammar like this: IF ERROR '(' exp ';' exp ')' This way you run only as many regexes as necessary because the first hit is always the right one.

Are there ambiguous regex combinations which would need the longest match? (even with lookaround?)

gene-sis avatar Nov 16 '16 21:11 gene-sis

Specified order would seem to be the more useful approach to me. In PHP, for example, reserved words (if, class) would otherwise be interpreted as function or constant names if the lexer didn't respect the order, I think. The regexes for both would match the same code.

hikari-no-yume avatar Nov 16 '16 21:11 hikari-no-yume

In which "logical order" do you test for variable names and keywords? Variable names first will cause "class", "if", "while" etc. to be counted as variable names. Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword, followed by a variable name. So then you'd have to bake parts of the syntax into the regexps (like "if" is only a keyword if it is followed by a space... or a tab or linefeed or parenthesis... you'll never see the end of these cases.)

AlgorithMan-de avatar Nov 16 '16 21:11 AlgorithMan-de

The issue of “Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword” can be avoided by only matching on word boundaries with \b.

But, yes, come to think of it, simply matching the longest token would be simpler.

hikari-no-yume avatar Nov 17 '16 00:11 hikari-no-yume

You'd still need to handle the case of keywords, though. if would be matched as both an identifier and a keyword. Would the latter take priority if it was specified earlier?

hikari-no-yume avatar Nov 17 '16 00:11 hikari-no-yume

Yes, because it chooses the FIRST longest match - because a regexp is chosen if its token is > than the previous one - not >= . That means it selects the first regexp that has a longest match - later regexps can only be chosen if they give you a longer match. So you only need to put keywords before identifiers. That is the commonly used approach at writing tokenizers (see lex, flex, jflex, etc)

AlgorithMan-de avatar Nov 17 '16 20:11 AlgorithMan-de

Alright, that seems reasonable.

hikari-no-yume avatar Nov 17 '16 20:11 hikari-no-yume

keywords ^((class|if|while)(?![_a-zA-Z])|(else(?=[^_a-zA-Z]|if[^_a-zA-Z]))) comment till the end of line ^(\/\/[\S \r\t\f ]*) inline comment ^(\/\*[\S\s]*?\*\/) variable identifier, operators, parentheses ^([$\.;()+\-\/*=<>!:;]) item name (variable name, constant name, function name) ^([_a-zA-Z]*)

I use a grammar with numbers, {variable name}, functions( exp [; ... ] ) and mathematical operations so I won't need the complexity of the PHP language. I suppose, that there are better and more effective ways of lexing a complex language like PHP.

gene-sis avatar Nov 17 '16 21:11 gene-sis

Selecting the first longest match is the most useful way to go. It precludes a whole bunch of hard-to-debug issues.

I'll look into integrating something easy like this. Thanks!

rvanvelzen avatar Nov 21 '16 14:11 rvanvelzen

combining the two, I have already made a relatively strong parser for a programming language :-)

AlgorithMan-de avatar Nov 24 '16 08:11 AlgorithMan-de