bnfc icon indicating copy to clipboard operation
bnfc copied to clipboard

Lexer sometimes is not reversible

Open safinaskar opened this issue 3 years ago • 6 comments

I'm using BNFC 2.9.1. Consider this .cf file:

EAB. Exp ::= AB;
EA_B. Exp ::= A B;

token A 'a';
token B 'b';
token AB 'a' ' ' 'b';

Generate haskell code from it and call this from main:

pExp $ myLexer $ printTree (EA_B (A "a") (B "b"))

You will get Right (EAB (AB "a b")). I. e. your lexer is not reversible. I think it is hard to make lexer reversible in this pathological case. But at least BNFC should print some error to warn user that something is wrong

safinaskar avatar Apr 24 '21 18:04 safinaskar

Well, it is standard that a lexer (or regular expression matcher) takes the longest match, so "a b" will always match AB. This usually intended, so if you have

token KwWhere "where"
token Id letter+

then whereas is never lexed as KwWhere followed by identifier as, but always as identifier whereas.

andreasabel avatar Apr 25 '21 16:04 andreasabel

@andreasabel , hi, thanks a lot for answer. It seems you don't understand me. My point is this: BNFC should guarantee that printing is always reversible. In other words, BNFC should guarantee that for any AST "x" the following statement is true:

"if isRight $ parser $ lexer $ printTree x, then (parser $ lexer $ printTree x) == Right x"

If BNFC unable to provide such guarantee, then BNFC should give error instead of producing .y files.

I want reliable tool. Such that I can simply write input file for BNFC and be sure that BNFC will check everything for me

safinaskar avatar Apr 25 '21 17:04 safinaskar

Ok, I see, Askar, you want what is often called a round-trip property giving you correctness of your parser and printer.

I think you cannot get this property in the form you stated, without any constraints on x. For instance, if x is an identifier that clashes with a keyword, you can print that, but not parse it again.

Consider this BNFC grammar:

Prg.  Program ::= "return" Ident;

Then Prg "return" is a perfectly well-formed syntax tree for Program, but it will print as return return which gives you a parse error.

A typical round-trip property is (writing parse for the combination of lexing and parsing, and print for printTree):

If x = parse text is not an error, then parse (print x) = x.

A bit slopply said, parse . print . parse = parse.

In the OP, if I take a b (one space) as input text, it will parse as EAB (because of longest match) and then satisfy the round-trip property. However, if I take a  b (two spaces), it will parse as EA_B, print as a b, parse as EAB and thus violate the round-trip property.

One way to exclude your counterexample would be to forbid whitespace in tokens.

But there are other obstacles to the round-trip property.

  • For instance, the printer does not always separate tokens by whitespace. Concretely, it does not put a space before a semicolon, which is often a token (e.g. in C/Java). So having a keyword that includes a final semicolon, like foobar;, would produce a clash with identifier foobar followed by semicolon.

  • The grammar could be ambiguous. A famous example is the dangling else.

    SExp.    Stm ::= Exp ";";
    SIf.     Stm ::= "if" "(" Exp ")" Stm;
    SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm;
    EInt.    Exp ::= Integer;
    

    The tree

      SIfElse (EInt 0)
       (SIf (EInt 1) (SExp (EInt 2)))
       (SExp (EInt 3))
    

    will print as

    if (0) if (1) 2; else 3;
    

    and then parse as

      SIf (EInt 0)  
        (SIfElse (EInt 1) (SExp (EInt 2)) (SExp (EInt 3))
    

    violating the round-trip property. (To be fair, for this grammar you actually get a shift-reduce conflict warning from the parser generator.)

All in all, lexer, parser, and printer need to be carefully developed to ensure the correctness property. This is likely a few years R&D time, and I would do this in Agda rather than Haskell, in order to prove the round-trip property. And, I would not target lexer and parser generators, but output code directly a parser, so I can reason about its behavior in Agda...

Anyway, do you have a concrete list of situations you want to be warned about by BNFC?

andreasabel avatar Apr 25 '21 19:04 andreasabel

@andreasabel

For instance, if x is an identifier that clashes with a keyword, you can print that, but not parse it again.

I agree. For this reason printing should be partial function (something like a -> Maybe b). Such change should be done anyway, because successful printing of such AST is, of course, bug.

I think printing and parsing should be functions with such signatures:

print :: AST -> Maybe String
parse :: String -> Maybe AST

And the following should be true: a. Result of parsing should be printable, but printing not necessary should give same string. I. e. if "parse s == Just x", then "print x /= Nothing". (But not necessary "print x == Just s".) b. Parsing of result of printing should give same representation. I. e. if "print x == Just s", then "parse s == Just x"

Of course, this is just sketch, you may have "Either SomeError String" instead of "Maybe String", etc.

Currently I am searching "perfect parsing solution" for my needs, I often need to parse something. I often use Haskell. I need something like BNFC, i. e. a tool which takes complete description of a language and produces both parser and printer. (Not necessary pretty printer, just a printer). I described my problem here: https://mail.haskell.org/pipermail/haskell-cafe/2021-January/133275.html , I recommend you reading that, I give a lot of links to other projects.

It seems BNFC currently doesn't provide reversibility, so I will continue searching. If I found nothing, I will write such solution.

I already spent a lot of time thinking how to do this, so I am nearly sure this is possible.

But there are other obstacles to the round-trip property

All they can be dealt with. By putting additional restrictions on the language, which should not be too restrictive in practice.

So having a keyword that includes a final semicolon, like foobar;, would produce a clash with identifier foobar followed by semicolon

Yes. For this reason BNFC should find all pairs of token types which should be always separated by whitespace (in your example identifier and semicolon will be such pair). Such pairs can be easily found using some library for dealing with regular languages, i. e. library which can determine whether two given regular languages intersect. For Haskell such library is http://hackage.haskell.org/package/kleene . (But, of course, you can manually implement needed algorithms.)

In fact, I already started to write my parsing solution. I already wrote lexer using "kleene". The next step will be converting list of tokens back to text. I am sure I will get reversibility thanks to rejecting of "wrong" languages using "kleene".

I can show you code if you want.

The grammar could be ambiguous

Yes. We should check ambiguity and reject ambiguous grammars. This is possibly the hardest part, because checking for ambiguity for arbitrary CFG is impossible on Turing machine. But we can restrict ourselves to some subset of set of all CFGs (see my letter https://mail.haskell.org/pipermail/haskell-cafe/2021-January/133275.html for options). Or we can just try to find ambiguity using brute force and (if grammar seems to be non-ambiguous) parse input using some method which can detect ambiguities in parse time, for example, Earley algorithm. So we will find most ambiguities at moment of invocation of BNFC and will find remaining ones at moment of actual parse. This is method I plan to use in my solution.

A famous example is the dangling else

Yes. So we should either force user to manually rewrite this grammar to be non-ambiguous (this may grow its size x2) or give user some tools to disambiguate it (bison has such constructs).

All in all, lexer, parser, and printer need to be carefully developed to ensure the correctness property

Yes

This is likely a few years R&D time

I am more optimistic. :) I already did some research. I think it will take me 1 month to finish my project. I already started, as I said.

I already have plan for my project.

First, I will write reversible lexer. This could be done by carefully checking properties of regular languages of tokens using "kleene".

Then reversible parser. I will probably use some existing implementation of Earley algorithm. Good property of this algorithm is this: it gives all possible parse trees, so we can detect ambiguity in parse time, if we overlooked it before. I will use brute force in parser generation time combined with additional checking for ambiguity in parse time, as I described.

The following stage is converting raw output of parser (parse tree) to AST and vice versa. I didn't think about this a lot. But I think this is possible using something like this: http://hackage.haskell.org/package/invertible-0.2.0.7/docs/Data-Invertible-TH.html .

As additional guard I plan to check my implementation using https://hackage.haskell.org/package/QuickCheck . I. e. just test that reversibility equations using QuickCheck.

Anyway, do you have a concrete list of situations you want to be warned about by BNFC?

I want full reversibility. Okey, I understand you. I will continue my search and probably will write my own solution.

Also. https://www.brics.dk/xsugar/ does reversible transformation between arbitrary language and XML. Xsugar even proves non-ambiguity for CFG. So, Xsugar proves that this is possible. Xsugar uses single file to describe both lexer and parser, Xsugar's descriptions are similar to BNFC's. The reason why I don't want to use Xsugar is this: Xsugar's lexer is not flexible, in particular it is not even possible to make Xsugar's lexer to correctly parse JSON strings.

Also, https://github.com/boomerang-lang/boomerang does reversible transformation between two languages. So, it proves possibility of such solution, too. The problem is this: boomerang does not produce AST, it produces text (as well as I understand, I am currently investigating this project). So, I will have to convert target language, say, to s-expression and then parse this s-expression in my Haskell code. Still, boomerang is reasonable, and possibly I will stick to it

safinaskar avatar Apr 25 '21 20:04 safinaskar

Thanks for the pointers.

Other works that you could look at verified parsing:

  • A Verified LL(1) Parser Generator, https://dblp.org/rec/conf/itp/LasserCFR19
  • TRX: A Formally Verified Parser Interpreter, https://arxiv.org/abs/1105.2576
print :: AST -> Maybe String
parse :: String -> Maybe AST

My take on this would be a more refined version of AST that allows a total printing function AST -> String. So I would use dependent types to ensure that AST is something that satisfies the necessary conditions (like no keyword/identifier clash).

Unfortunately, atm I do not have time to work on the grand plan of a verified parser generator. My goals for BNFC since I took it over are more modest: If BNFC accepts a grammar, then the parser generation does not crash and produces compilable code. This is still not true in all cases, I did work on this in #235, #278, #289...

andreasabel avatar Apr 26 '21 08:04 andreasabel

@andreasabel , I wrote library I sketched here in https://github.com/BNFC/bnfc/issues/360#issuecomment-826387556 . Here it is: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html

safinaskar avatar Jul 10 '21 23:07 safinaskar