syntax icon indicating copy to clipboard operation
syntax copied to clipboard

Java implemention with bnf does not generate valid Java

Open enebo opened this issue 7 years ago • 7 comments

Trying out the new Java support. My uses will want to use yacc/bison flavor of grammar. Given:

./bin/syntax  -m LALR1 --grammar ../syntax_test/src/com/syntax/calc.bnf -o ../syntax_test/src/com/syntax/Parser.java -w

Here is the .bnf:

%left '+'
%left '*'

%%

exp
  : exp '+' exp
    /* Explicitly calculate location */
    { $$ = new BinaryExpression("+", $1, $3); }

  | exp '*' exp
    /* Use default result location: @$ */
    { $$ = new BinaryExpression("*", $1, $3) }

  | '(' exp ')'
    { $$ = $2 }

  | number
    /* Named args and position */
    { $$ = new NumericLiteral(Integer.parseInt((String) $number)) }
  ;

literal
  : number
  ;

number
  : number digit
    { $$ = (String) $number + (String) $digit; }

  | digit
  ;

digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

Generating this will show two immediate issues:

_lexRule0 has no lhs for semanticValue = null;

  String _lexRule0() {
      .semanticValue = null;
  }

The classes generated need to be inner classes to be valid. Which led me to add the following to the top of the file:

%{
class Parser {

%}

Unfortunately, I could not figure out how to add the closing '}'. In yacc I would solve this with adding:


%%

}

underneath the grammar section but this grammar does not seem to like that.

If I manually add that final '}' then it exposes a second tier issue that there are static method declarations in inner classes which are not marked static.

Not sure how much I should have broken this up for the report. I figure I would dump all I noticed here. Let me know if I can report stuff differently. I am sure once the basics are working I will be able to make more granular reports.

enebo avatar Dec 05 '18 17:12 enebo

@enebo, thanks, I'll take a look. The bad generated code for _lexRule0 is due to the -w option as I identified from the quick tests (I'll fix it). Also, notice, that rules like:

digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

is only for educational purposes (I should probably remove this example), since build unjustified large parsing table. In this case an actual lex rule should be used, as for [0-9], or \d+ (ideally I should probably identify such patterns in BNF rules, and build \d+ automatically for it, worth opening an issue for this).

So your grammar would be:

%lex

%%

\s+       /* skip whitespace */ return null
\d+       return "NUMBER"

/lex

%left '+'
%left '*'

%%

exp
  : exp '+' exp
    { $$ = new BinaryExpression("+", $1, $3); }

  | exp '*' exp
    { $$ = new BinaryExpression("*", $1, $3) }

  | '(' exp ')'
    { $$ = $2 }

  | NUMBER
    { $$ = new NumericLiteral(Integer.parseInt((String) yytext)) }
  ;

Take a look at this example, and also an example of building AST nodes (there is also README.md for Java plugin).

DmitrySoshnikov avatar Dec 06 '18 04:12 DmitrySoshnikov

I also opened #71 to fix the impractical '0' | '1' | '2'... BNF rules (again, on practice they should be moved to lexer; I removed this bad pattern from examples in https://github.com/DmitrySoshnikov/syntax/commit/e96884ca12029144109fdc89c54a4bfa7a4146a7).

DmitrySoshnikov avatar Dec 06 '18 04:12 DmitrySoshnikov

The -w option is fixed in https://github.com/DmitrySoshnikov/syntax/commit/89996debd4e039e114296c4ccfa9f8d2c5958e66, and is available in v0.1.11.

What's up with the inner classes? Can you show an example where it's not working. You may define the AST nodes in another package, and then just import them in the module include section.

E.g. you have com/syntax/ast/BinaryExpression.java, etc, and then in the grammar:

%{

import com.syntax.ast.*;
...

%}

DmitrySoshnikov avatar Dec 06 '18 04:12 DmitrySoshnikov

@DmitrySoshnikov I was just starting with an example from your repo (C# loc was starting point). In the long run I am not very interested in the lexer piece of Syntax since Ruby's is necessarily hand written (unfortunately). I just want some simple example working before I try a larger grammar or compare how well Jay compares with Syntax.

The AST nodes I did move into other files already. It is all the Parser-specific classes generated like YyLoc, Token, ... Those are all dumped into a single file specified by -o and is not valid in Java. I assumed they would be wrapped into some outer class otherwise they all need to be emitted into separate files.

enebo avatar Dec 06 '18 16:12 enebo

@enebo, yeah, if you'll need a separate tokenizer, there is no way automatically provide one for plugins (only JavaScript supports it). You can manually modify the generated file though, removing the default tokenizer from it, and implementing the same API (getNextToken(), hasMoreTokens(), etc). In this case LR parser will be able to work with it.

As for AST node classes, what I meant is not to put them into the module include, but put into a separate package, and require these classes in the module include. In this case external code will be able to require the AST node classes as well.

Also, the single generated parser class contains only one public class. All the other classes are private to it, and it should work. Have you tried end-to-end instructions from the README.md? Let me know if Java can't compile something there. In my tests all the inner classes work fine in Java, and only one public class is exported in the generated code.

DmitrySoshnikov avatar Dec 06 '18 21:12 DmitrySoshnikov

@DmitrySoshnikov I will follow those step by step and let you know but I gave instructions and file with what I used above and they are not much different. I do not see a sizable difference between the two unless the -w did something odd? I will play a little bit...

enebo avatar Dec 06 '18 21:12 enebo

Wacky. I don't get it ... it now works. All the inner class stuff does not seem to be issue. I am fairly confused now. My only thought was I used npm build from master vs actual release. I guess though I have no current issues after 89996de.

@DmitrySoshnikov as to your other suggestion I will try changing the source so I can plug in a lexer. It should be good enough for a test.

I might suggest that a character-based parser in Java should maybe work with a Reader instead of a String, but that may be a future enhancement issue I can open (unfortunately Ruby supports encodings that Java charsets do not support so we do everything with raw bytes (or InputStream)). Decoupling lexer as interface would definitely allow switching between different input types...but again a feature request.

Thanks for your work!

enebo avatar Dec 06 '18 22:12 enebo