lark icon indicating copy to clipboard operation
lark copied to clipboard

Complain about Lark

Open erezsh opened this issue 4 years ago • 19 comments

Do you have any problems with Lark that aren't necessarily bugs?

Is there anything about the design that you don't like? Or the interface? Don't like how issues or releases are handled?

This is the place to complain!

It's okay if there are no practical solutions either. I'm just curious to hear.

erezsh avatar Nov 17 '19 10:11 erezsh

My complain about lark is that I cannot use it when programming on C or C++, Java, Javascript, etc. A possible solution would be generate a LALR parser in C/C++, Java, Javascript, etc, which can be imported and used by these applications.

Although, while programming in Python, I never generated an parser. I just pip install lark-parser and import it directly into my application. Then, when programming on other languages, I would have to change my workflow. As part of the build process, I would have to call some script to run a Python program to generate a self contained Lark parser. This would be nice because still can install lark with pip install lark-parser and use it in any other language.

This would also be portable for old GCC compilers as 4.4 because on these systems as Debian 7, the available Python interpreter is Python 2 and lark supports Python 2. However, lark would have to generate C code compatible with standard 99 and C++ code compatible with standard 98.

I just wonder how the Python grammar would be exportable to C++, as it uses a python class directly for the indentation: https://github.com/lark-parser/lark/blob/54027942515054682a2958d7a7570a162311c177/examples/python_parser.py#L15-L25

That is also something I do not understand well in lark. While looking into ANTRL grammars, I see they directly run Java code, for calculation the Python indentation in their Python grammar: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4#L36-L54

...
tokens { INDENT, DEDENT }

@lexer::members {
  // A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
  private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
  // The stack that keeps track of the indentation level.
  private java.util.Stack<Integer> indents = new java.util.Stack<>();
  // The amount of opened braces, brackets and parenthesis.
  private int opened = 0;
  // The most recently produced token.
  private Token lastToken = null;
  ...

  @Override
  public Token nextToken() {
    // Check if the end-of-file is ahead and there are still some DEDENTS expected.
    if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
      // Remove any trailing EOF tokens from our buffer.
      for (int i = tokens.size() - 1; i >= 0; i--) {
        if (tokens.get(i).getType() == EOF) {
          tokens.remove(i);
        }
      }
      ...

Can lark grammars run Python like ANTLR runs Java code in their grammar files?

evandrocoan avatar Nov 17 '19 16:11 evandrocoan

A possible solution would be generate a LALR parser in C/C++, Java, Javascript, etc,

Absolutely. That is exactly what I did here: https://github.com/erezsh/Lark_Julia

Of course, that still requires writing a lot of code in the target language.

Indenters would have to be written in the target language, and applied when loading the parser.

Can lark grammars run Python like ANTLR runs Java code in their grammar files?\

No, by design. It would be fairly easy to add, but it would also make grammars very difficult to read, and make them unportable. It's also very rare that this is actually required. Lark's Transformer class takes care of most cases in which it would be used (And postlexing takes care of many of the rest).

P.S. Technically none of this is a complaint, so I'm very disappointed!

erezsh avatar Nov 17 '19 17:11 erezsh

I think they are, from my understanding: image

evandrocoan avatar Nov 17 '19 18:11 evandrocoan

Not to make a thing out of it, but your only complaint was that Lark doesn't work outside of Python, which is a bit like complaining that Python doesn't run on Javascript. The rest was more a request for information. But there's no need to split hairs, I was just being silly.

erezsh avatar Nov 17 '19 18:11 erezsh

But there's no need to split hairs, I was just being silly.

Me too 🙂

evandrocoan avatar Nov 17 '19 18:11 evandrocoan

It is difficult to determine whether one has written an ambiguous grammar. Or, put another way, is it possible, given a Lark earley parser, to write a function that takes the parser and returns an input string, when fed into the parser, produces a tree with an _ambig node, or None if no such strings exist.

jnwatson avatar Nov 26 '19 15:11 jnwatson

is it possible, given a Lark earley parser, to write a function that takes the parser and returns an input string, when fed into the parser, produces a tree with an _ambig node, or None if no such strings exist.

Can you give an example about how this would work (supposing/pretending lark already had implemented it)?

evandrocoan avatar Nov 26 '19 22:11 evandrocoan

is it possible, given a Lark earley parser, to write a function that takes the parser and returns an input string, when fed into the parser, produces a tree with an _ambig node, or None if no such strings exist.

Can you give an example about how this would work (supposing/pretending lark already had implemented it)?

import lark

def find_ambiguity(parser):
    '''
    Returns a string, given a Parser, that when parsed with parser,
    would return a tree with an _ambig node (if ambiguity is explicit)
    '''
    # TODO: draw rest of owl
    return '1'

grammar = r'''
    start: LETTERS | NUMBERS
    LETTERS: /\w+/
    NUMBERS: /\d+/
'''
parser = lark.Lark(grammar, ambiguity='explicit')
ambigstr = find_ambiguity(parser)

ambigtree = parser.parse(ambigstr)
assert '_ambig' in str(ambigtree)

jnwatson avatar Nov 26 '19 22:11 jnwatson

See paper for one strategy. It doesn't look easy by any means.

jnwatson avatar Nov 26 '19 23:11 jnwatson

The error messages / debugging output is only ok--and the #1 thing I'd look for in a parser is GREAT error messages. You do consistently point to the character where there is a problem, but don't consistently explain what went wrong. BTW I'm a quite new user (installed this 20 minutes ago), but somewhat familiar with parsers and EBNF already.

Why it's important Most people using a pre-existing grammar are also using a pre-existing parser, so you should expect your users to be making a new grammar. Most people writing a grammar are doing it for the first time, and it's a fairly difficult task, so I think really good error messages are important.

There's two situations I know where lark prints errors--parsing the grammar, and running the parser. Of the two, running the parser is more important (and harder) to give good errors for. (I realize multiple backends probably makes that a pain).

One additional thing to consider here is, a lot of the projects using lark are just going to directly print lark errors, so it's probably important to be even more user-friendly than you'd expect.

Specific examples Here are some specific errors I'm hitting within the first 20 minutes of trying out lark. I don't think it's that important to fix these specifically, just trying to explain ok vs great errors.

  • When parsing a EBNF grammar, I put '.' or '-' into a terminal name (only '_' works). The error message is lark.exceptions.GrammarError: Unexpected colon at line 3 column 14. A clearer message would be 'BLANK.PAGE' is not a valid terminal name because it contains '.'. (side note: https://lark-parser.readthedocs.io/en/latest/grammar/ contains an example name with '.' in it)
  • When parsing with my grammar, I see something like Expecting: {'__ANON_0'}, and a pointer to line 1, character 1. What is _ANON_0? I've never defined this (I assume this is a parenthesized expression in some rule? but I don't think most people would guess that). How much of a tree has it constructed, what rule is it on?

za3k avatar May 04 '20 23:05 za3k

@za3k Thanks for your input!

I agree with everything you said. Good error messages are really helpful, especially to beginners, but they are also very difficult.

Part of the difficulty is that the best errors have some understanding of what the user was probably trying to accomplish. That requires some understanding of common usage and idiomatic expectations, that a parser cannot know.

So, the exceptions Lark throws are meant for the developers, not the users. And I agree that Lark has a lot to improve there. However, how to improve it isn't obvious, to me at least. But I might take a closer look at it in the near future.

Meanwhile, I made a small commit that might help debug lalr grammars when debug mode is on. It basically just prints the state stack, which might give you a clue (https://github.com/lark-parser/lark/commit/c56112eea39dcee34ec1509aef508665f558ba89)

Regarding _ANON_0, it's an unnamed token. And yes, it's not the best thing to display, just the easiest.

erezsh avatar May 12 '20 14:05 erezsh

@za3k You can avoid the _ANON_0 by explicitly naming all your tokens. Lark has built-in names for the single-character tokens, but anything else, you'll need to come up with your own name.

jnwatson avatar May 12 '20 16:05 jnwatson

Dumping the state stack seems like a helpful idea, I wouldn't have thought of it. Thanks for making the change. Is debug mode on by default? I would have no idea how to turn it on if not, so I don't think that will help most people. You'd need to

  • Have it on by default
  • OR put it prominently in the documentation. I'd suggest "to turn on more detailed debugging information..." in all-but-whitelisted exceptions as a good place to document it.

So having used lark a little more (I used it a couple days, then gave up and wrote a special-purpose parser, which was way easier to maintain and debug), I can actually say I ran into the _ANON problem several times, and it might be worth fixing specifically after all.

Also since I didn't say before, thanks for making this nonspecific thread and caring about user feedback! It's a good idea, I should steal it for my projects. I want people to tell me why my design and goals are wrong :)

za3k avatar May 17 '20 06:05 za3k

@za3k Thanks for taking the time to describe your experience! I try to make Lark as simple and accessible as possible (without taking away from it), but that's not an easy task in this domain.

FWIW, You can turn on debug with Lark(..., debug=True).

It might be nice to have a FAQ section, or similar, for all the small and common things beginners might need. Like "what does _ANON mean" or "why doesn't my lookback regex work?", etc. I'll give it some thought!

erezsh avatar May 17 '20 07:05 erezsh

I doubt this will be of any help whatsoever to anyone, but I just thought I'd mention why I decided to use Lark (that is, a context free grammar and parser--none of this is Lark-specific) at all, and then why I stopped.

Before grabbing Lark, I had a hand parser, but it was a bit jumbled, and the logic for parsing and for doing things with the parsed content was all mixed up. I thought "aha, this will be easier to work with if I write the format declaratively". The only way I know to write a(n arbitrary) format declaratively is to use a grammar.

But, the declarative result was not easy to read, easy to debug, or even easy to make describe what I wanted. One problem with "easy to describe what I wanted" was that I wanted "split by..." functionality. For example, I am parsing pages, and each one is separated by the phrase "--page". This is super easy to do with regex split or even fixed-string split. I'd like to be able to take this "top down" approach, where I first get pages, then parse each page. Writing a (single) unambiguous grammar is really hard, where "--page" cannot be included in any part of the page--now that regexes have negative lookaround you can do it a bit easier, but the result is ugly as hell to read. If my separators weren't fixed strings, it would have been nearly impossible.

In an effort to add "split by", I first wrote a (small) parser-generator where I could default to smallest-match rather than largest-first which made ambiguous grammars default to the option I wanted (and IIRC with explicit "split by", because I don't think that always works). It was still really verbose and hard to follow as a human reader.

Then, I threw out the entire approach and wrote the entire thing in terms of matching and splitting by regex (my hand parser also did this, but I switched away from being declarative). This was way more readable.

I also split apart a checker/linter (makes sure everything is in the exact right format) from a parser (parses that content assuming it's in the right format), which simplified the parser enormously.

Based on this and a couple related experiences, I get the impression that we need a declarative way to talk about formats that does not match how context-free grammars are classically defined (maybe similarly to how modern regular expressions are much better than just |, concatenation, and '*'; maybe another tool entirely). This is certainly not the only time someone has thought "aha, I want to write a declarative spec for my format" and then gotten bogged down in grammars.

za3k avatar May 17 '20 21:05 za3k

@za3k Context-free grammars are definitely not good for everything, and they can have difficulties when non-annotated free text is involved.

Usually I find that using Lark serves me better than what I could write manually, and on a wide range of parsing tasks. But I also have a lot of intuition on how to use it, since I wrote it and thought a lot about it, which is very hard to communicate.

Still, sometimes regex really is the best tool for the job. Or, at least, the easiest.

erezsh avatar May 17 '20 21:05 erezsh

I feel like Lark has potential to help find ambiguities in grammars. Specifically, you could set parser='earley' and ambiguity='explicit', then see what _ambig trees are produced for your various test cases. However, the existence of #536 means you could have ambiguities that are not detected.

charles-esterbrook avatar Nov 12 '20 01:11 charles-esterbrook

I noticed that Lark has hard-coded names for certain one-character literals that also appear in its error messages. This was quite surprising to me and at first I didn't understand what was going on. In fact, after having checked out the simple hello world example from the README and playing around with it, I ended up with an error like

lark.exceptions.UnexpectedEOF: Unexpected end-of-input. Expected one of: 
        * BANG

which at first glance gave me the impression that this library doesn't have error messages whatsoever and just tells you "BANG - something went wrong, go figure it out for yourself". This is obviously in part on me because I am used to read Python exceptions in a bottom-up fashion, stopping as soon as I think I have the information that I need.

However, while I think having named terminals in error messages for non-printable characters can be quite neat, I think it would be more clear if they were instead named explicitly. E.g. the above error message reading something like Expected '!' or Expected the character '!'.

This is just a minor thing I've stumbled upon though - now that I know that named terminals are a thing I can fully appreciate that I have finally found a Python parser generator that can do proper error handling :heart:

Krzmbrzl avatar Nov 24 '23 09:11 Krzmbrzl

which at first glance gave me the impression that this library doesn't have error messages whatsoever and just tells you "BANG - something went wrong, go figure it out for yourself"

I'm sorry, but that made me laugh out loud 😆

It's a good point, these error messages can be improved. I'm a little swamped lately, but maybe someone can take it upon themselves, or I'll get to it at some point. Thanks for your input!

erezsh avatar Nov 24 '23 09:11 erezsh