gocc
gocc copied to clipboard
Grammar question
Been playing around with gocc and it's a great tool.
I have a few questions regarding how to set up the different lexical elements
I want to be able to parse out these items:
- double quoted strings as string literals eg "hello world"
- single quoted strings as string literals eg 'hello world'
- bare strings that contain alphanumeric + '_' + '.' character as variables eg device.platform_type
I have the following lexical elements defined
_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';
This correctly parses double quoted strings but does not work for single quoted strings or bare strings.
What am I doing wrong?
Full grammar file for context
/* Lexical elements */
_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';
!whitespace : ' ' | '\t' | '\n' | '\r' ;
Expression : Expression "and" Expression << ast.NewLogicalAnd($0, $2) >>
| Expression "or" Expression << ast.NewLogicalOr($0, $2) >>
| "not" Expression << ast.NewNegation($1) >>
| Term
;
Term : Factor ">" Factor << ast.NewComparisonGreaterThan($0, $2) >>
| Factor ">=" Factor << ast.NewComparisonGreaterThanEquals($0, $2) >>
| Factor "<" Factor << ast.NewComparisonLessThan($0, $2) >>
| Factor "<=" Factor << ast.NewComparisonLessThanEquals($0, $2) >>
| Factor "=" Factor << ast.NewComparisonEquals($0, $2) >>
| Factor "==" Factor << ast.NewComparisonEquals($0, $2) >>
| Factor "!=" Factor << ast.NewComparisonNotEquals($0, $2) >>
| Factor "is not" Factor << ast.NewComparisonIsNot($0, $2) >>
| Factor "is" Factor << ast.NewComparisonIs($0, $2) >>
| Factor "contains" Factor << ast.NewComparisonContains($0, $2) >>
| Factor "matches" Factor << ast.NewMatches($0, $2) >>
| Factor
;
Factor : int_lit << ast.NewLiteralInt($0) >>
| float_lit << ast.NewLiteralFloat($0) >>
| variable << ast.NewResolver($0) >>
| string_lit << ast.NewLiteralString($0) >>
| "true" << ast.NewLiteralBool(true) >>
| "false" << ast.NewLiteralBool(false) >>
| "undefined" << nil, nil >>
| "null" << nil, nil >>
| "empty" << nil, nil >>
| "(" Expression ")" << ast.NewClause($1) >>
;
(edited by @mewmew as the BNF grammar was otherwise distorted by Markdown formatting, resulting in panic: Error decoding rune. Lit: '', rune: 39, size1
from Gocc as '_'
was replaced with ''
using Markdown formatting)
There doesn't look to be anything wrong with the lexical part of your grammar.
The following simplified grammar is capable of lexing double quoted string literals, single quoted string literals, and identifiers.
/* Lexical elements */
_digit : '0'-'9' ;
_letter : 'a'-'z' | 'A'-'Z' ;
_alphanumeric : _letter | _digit ;
variable : _letter { _alphanumeric | '.' | '_' } ;
int_lit : _digit {_digit} ;
float_lit : _digit { '.' | _digit } ;
string_lit : '"' {.} '"' | '\'' {.} '\'';
!whitespace : ' ' | '\t' | '\n' | '\r' ;
Factor
: variable
| string_lit
;
The following Go program (available at play.golang.org) was used to test the lexer and parser.
package main
import (
"fmt"
"log"
"play/b/lexer"
"play/b/parser"
)
func main() {
inputs := []string{
`"hello world"`,
`'hello world'`,
`hello_world`,
}
for _, input := range inputs {
if err := parse(input); err != nil {
log.Fatal(err)
}
}
}
func parse(input string) error {
l := lexer.NewLexer([]byte(input))
p := parser.NewParser()
result, err := p.Parse(l)
if err != nil {
return err
}
fmt.Printf("%#v\n", result)
return nil
}
The above program gives the following output for the simplifed BNF grammar.
&token.Token{Type:3, Lit:[]uint8{0x22, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x22}, Pos:token.Pos{Offset:0, Line:1, Column:15}}
&token.Token{Type:3, Lit:[]uint8{0x27, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x27}, Pos:token.Pos{Offset:0, Line:1, Column:15}}
&token.Token{Type:2, Lit:[]uint8{0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x5f, 0x77, 0x6f, 0x72, 0x6c, 0x64}, Pos:token.Pos{Offset:0, Line:1, Column:13}}
Three tokens from two distinct token types (string_lit and variable) have been lexed successfully. Which version of Gocc are you running?
Finally got some time to look at this again.
I found out what was confusing me. The string_lit is parsed out correctly but the surrounding quotes are included in the string. All my tests were assuming the quote would be stripped.
Curious, is there a way to structure the grammar so the surrounding quotes are removed?
Thanks for your reply this is a fantastic library! 💃
Curious, is there a way to structure the grammar so the surrounding quotes are removed?
Take a look at the astx example.
From astx/ast.bnf:
Stmt :
id << ast.NewStmt($0) >>
;
From astx/ast/ast.go:
func NewStmt(stmtList interface{}) (Stmt, error) {
return Stmt(stmtList.(*token.Token).Lit), nil
}
You may easily adapt the code within the semantic action expressions (i.e. within <<
and >>
) to unquote the string literals.
Something along the lines of:
String:
string_lit << ast.NewStringLit($0) >>
;
func NewStringLit(lit interface{}) (string, error) {
return strconv.Unquote(string(lit.(*token.Token).Lit)), nil
}
More questions:
Does the string "empty" have significant meaning in gocc? When I try parsing the string empty
I get the following error.
Unable to build AST: Error in S0: empty(15,empty), Pos(offset=0, line=1, column=7), expected one of: $ and or not > >= < <= = == != is contains matches variable string_lit int_lit float_lit null true false undefined (
If I change the parser to use the string "dempty" instead my unit tests pass.
Based on the error message above, it appears "empty" is being removed from the symbol list.
Factor
: variable << ast.NewResolver($0) >>
| string_lit << ast.NewLiteralString($0) >>
| int_lit << ast.NewLiteralInt($0) >>
| float_lit << ast.NewLiteralFloat($0) >>
| "empty" << ast.NewNull() >>
| "null" << ast.NewNull() >>
| "true" << ast.NewLiteralBool(true) >>
| "false" << ast.NewLiteralBool(false) >>
| "undefined" << ast.NewNull() >>
| "(" Expression ")" << ast.NewClause($1) >>
;
empty
is a reserved keyword for the empty string.
From https://en.wikipedia.org/wiki/Empty_string:
In formal language theory, the empty string is the unique string of length zero.
And from the Gocc User Guide:
A syntax body may also be empty, indicated by the reserved word,
empty
. This allows you to write productions with optional elements, such as:A : B "c"; B : "b" | empty;
I'd recommend reading the Gocc User Guide going forwards, it is a treasure trove of wonderful information about formal language theory and grammars.
Note, the reserved keyword empty
should be used in production rules without any surrounding quotes. That is:
// valid
A : empty ;
// invalid
B : "empty" ;
Thanks for your response, I have reviewed the PDF but I was confused about the meaning. I understood that the bare string empty was a reserved word but I thought "empty" would be treated differently. I.e. "empty" and empty would have different meanings.
I am trying to build a Go implementation of a python library that has already defined a grammar that has attached meaning to the string "empty". It appears Gocc might not be able to support this.
Also today I was trying to add support for "is" and a "is not" terms. The parser appears to be confusing the "is not" term with "not" Expression.
Expression : Expression "and" Expression << ast.NewLogicalAnd($0, $2) >>
| Expression "or" Expression << ast.NewLogicalOr($0, $2) >>
| "not" Expression << ast.NewNegation($1) >>
| Term
;
Term : Factor ">" Factor << ast.NewComparisonGreaterThan($0, $2) >>
| Factor ">=" Factor << ast.NewComparisonGreaterThanEquals($0, $2) >>
| Factor "<" Factor << ast.NewComparisonLessThan($0, $2) >>
| Factor "<=" Factor << ast.NewComparisonLessThanEquals($0, $2) >>
| Factor "=" Factor << ast.NewComparisonEquals($0, $2) >>
| Factor "==" Factor << ast.NewComparisonEquals($0, $2) >>
| Factor "!=" Factor << ast.NewComparisonNotEquals($0, $2) >>
| Factor "is" Factor << ast.NewComparisonIs($0, $2) >>
| Factor "is not" Factor << ast.NewComparisonIsNot($0, $2) >>
| Factor "contains" Factor << ast.NewComparisonContains($0, $2) >>
| Factor "matches" Factor << ast.NewMatches($0, $2) >>
| Factor
;
Factor
: variable << ast.NewResolver($0) >>
| string_lit << ast.NewLiteralString($0) >>
| int_lit << ast.NewLiteralInt($0) >>
| float_lit << ast.NewLiteralFloat($0) >>
| "empty" << ast.NewNull() >>
| "null" << ast.NewNull() >>
| "true" << ast.NewLiteralBool(true) >>
| "false" << ast.NewLiteralBool(false) >>
| "undefined" << ast.NewNull() >>
| "(" Expression ")" << ast.NewClause($1) >>
;
Why not have
Factor "is" "not" Factor << ast.NewComparisonIsNot($0, $2) >>
I am still thinking about empty
Have you tried putting empty in the lexer as
emptyword : 'e' 'm' 'p' 't' 'y' ;
And then use that in the grammar
Factor
...
| emptyword << ast.NewNull() >>
...
;
I would like to know whether that also breaks?
Why not have
Factor "is" "not" Factor << ast.NewComparisonIsNot($0, $2) >>
Another option would be to use
Factor "is" Expression << ast.NewComparison($0, $2) >>
Since Expression
already includes "not" Expression
among others.
@mewmew Wouldn't that give possibility to have a parse conflict with the AST:
Factor "is" Expression
/ | \
Term1 "is" Expression "and" Expression
/ | \
Term2 "and" Term3
and
Expression "and" Expression
/ | \
Factor "is" Expression "and" Term3
/ | \
Term1 "is" Term2
? The not in "is not" is not the same as the not in "not" Expression.