gocc icon indicating copy to clipboard operation
gocc copied to clipboard

Error generating Gocc using Gocc

Open mewmew opened this issue 8 years ago • 6 comments

I get the following errors when trying to generate Gocc from the gocc2.bnf grammar, using the gen.sh script.

$ ./gen.sh
warning: undefined symbol "g_sdt_lit" used in productions ["FileHeader" "SyntaxBody" "SyntaxBody"]
warning: undefined symbol "prodId" used in productions ["SyntaxProduction" "Symbol"]
warning: undefined symbol "ignoredTokId" used in productions ["LexProduction"]
warning: undefined symbol "char_lit" used in productions ["LexTerm" "LexTerm" "LexTerm"]
warning: undefined symbol "string_lit" used in productions ["Symbol"]
warning: undefined symbol "tokId" used in productions ["LexProduction" "Symbol"]
warning: undefined symbol "regDefId" used in productions ["LexProduction" "LexTerm"]

The gocc2.bnf grammar contains the following TODO which seems related.

/*** TODO: ***
1. Handle reserved words correctly so that user cannot write reserved words in his grammar. E.g.: string_lit, prodId, etc.
***/

It is possible that the grammar has been left in an incomplete state, in wait for Gocc version 3. As mentioned in the readme (gocc3 user guide will be published shortly). Any news with regards to Gocc 3? What is the primary changes from version 2 to 3? Has the grammar been updated?

I'd love to dive into the Gocc code base and get thoroughly accustomed to it. Just want to make sure, I don't spend too much time reading code which has been made obsolte by a yet to be released version 3.

Cheers /u

mewmew avatar Feb 08 '16 18:02 mewmew

gocc3 has not been worked on for a few years. gocc3 would have been able to generate gocc2. me and goccmack are quite busy with other things at the moment. So gocc is in maintenance mode. We still heavily use gocc, but I don't see gocc3 happening anytime soon. I use it for my new project and my team internally uses it for about 4 projects.

I think "gocc3 user guide will be published shortly" should be removed from the manual.

@goccmack would have to comment on what the primary changes would have been between gocc2 and gocc3. I can't remember.

I think you can dive into the gocc code base confident that big changes are not going to happen anytime soon and that gocc is here to stay.

awalterschulze avatar Feb 09 '16 08:02 awalterschulze

gocc3 included two major changes:

  1. It supported an EBNF for input and
  2. It supported David Pager's "Practical General Method" for generating LR(1) parsers as an option to the Knuth LR(1) algorithm.

I am of the opinion that EBNF encourages grammars which are outside LR(1). Most EBNF grammars I wrote had lots of LR(1) conflicts. I would advise against adding EBNF to gocc again.

When I have the time again I would like to add Pager's PGM again because it generates smaller tables for full LR(1) grammars -- similar in size to LALR for the same grammar. I also think that the original paper is beautiful and that it is a pity that LALR became popular instead of this technique. The PGM parse table generator was working in gocc3 but not extensively tested.

I recalled gocc3 because of errors in the translation of EBNF to BNF for parser generation, which I didn't consider worth fixing because I peferred to revert to BNF input.

goccmack avatar Feb 09 '16 11:02 goccmack

Hi Walter and Marius,

Thanks for providing me with some background regarding Gocc 3. From what I understood, version 3 added EBNF which turned out to encourage grammars outside of LR(1), and is therefore not intended for inclusion in any future release of Gocc.

It is good that you mentioned about Pager's PGM method, and that it generates tables within the same range as LALR, because I was thinking of working on adding LALR support for Gocc in the future. Maybe, I'll read Pager's paper instead. As you already have experience with the algorithm, it would probably be easier for you to add support for it to Gocc 2, but I'd still be happy to learn as I go :)

I understand if both of you are busy, so I just wanted to ask one thing. How much work would you estimate is required to patch Gocc, or the grammar of Gocc 2 to be able to generate itself once more?

Is the current master branch considered to be Gocc 2? The the code for Gocc 3 ever appear on the public repository?

Cheers /u

mewmew avatar Feb 09 '16 13:02 mewmew

Hi Robin,

The errors you got when you ran gocc on its own BNF were for the lexical elements which are not declared the grammar. gocc2 still uses a hand written scanner. If you look at the gocc BNF you will see that the lexical definitions are commented out.

If I recall correctly I made gocc3 from gocc2 in about 3 weeks, but that included the PGM state machine generator and major refactoring of the package structure and AST. I also refactored the parse table generation so that I could use it for both Knuth and PGM state machines.

gocc3 is still in the repo. The last release candidate is labelled "gocc3 rc3". SHA Id: 1e6bdfb2732c34734542e1539519c0ef970912e8 I think the last commit of gocc3 before I reverted to gocc2 is sha id: f5de52b995e3af6722e3829010d81fea6e2c44b1

To modify gocc to compile itself would require first of all to modify the BNF to contain the lexical elements used by the syntax part. Then modify main.go to use the generated lexer instead of the current scanner. There were one or two issues with name clashes between lexical elements in the target grammar and those used in gocc2 for generating the new gocc. You may have to rename things like char_lit to char_literal. I don't remember the details anymore but it wasn't very difficult.

The PGM parser generator is in the gocc3 code base. If you look in package parser/lrk you will find knuth and pgm packages, which generate the respective state machines. Parser generation is in package parser/lrk/codegen. The reason for the elaborate package structure is that I prepared to add support for parser generation in other languages too. parser/lrk/codegen is where that would plug in.

If you want to tackle this project I suggest you start by reading both codebases to understand the differences and then work on gocc3. Remove the gocc3 EBNF rewriting and use the gocc2 BNF for input. By the end of the project you will understand LR(1) code generation very well!

Pager's paper isn't freely available. Here's the reference: A Practical General Method for Constructing LR(k) Parsers David Pager Acta Informatica 1977 vol 7 pp 249-268

It was published about a year after Knuths paper on LR(1) and a bit before the first LALR paper. I find PGM much easier to understand than most LALR descriptions. I bought the paper for the price of a small African farm from Springer who owns Acta Informatica's archives.

Pager had a PhD student who evaluated his two LR(k) table compression methods against LALR and newer methods LR(2) methods. Perhaps I can send you more detail in a private email.

Regards, Marius

goccmack avatar Feb 09 '16 13:02 goccmack

Oh and Robin,

I assume you will fix the lexer backtracking as part of such project:)

goccmack avatar Feb 09 '16 14:02 goccmack

llir/llvm@4c184db was supposed to reference #33 not #13. Sorry for the noise.

mewmew avatar Dec 03 '16 18:12 mewmew