lark icon indicating copy to clipboard operation
lark copied to clipboard

Reimplementation of end symbol (Issue #237)

Open erezsh opened this issue 4 years ago • 24 comments
trafficstars

erezsh avatar Apr 18 '21 03:04 erezsh

I don't remember the problem, only that it messed some grammars up. But I changed the LALR logic since, made it a little cleaner, so maybe that was solved on the way?

I'll make sure to do thorough tests before merging it in.

erezsh avatar Apr 18 '21 14:04 erezsh

I'm trying to test this right now. What's the terminal to use? $END?

ThatXliner avatar Apr 18 '21 19:04 ThatXliner

No, It is just $

MegaIng avatar Apr 18 '21 19:04 MegaIng

Maybe I'm going too far but is this supposed to work with custom Indenters? Because I'm facing a problem where

lark.exceptions.UnexpectedToken: Unexpected token Token('__$END$__', '') at line 1, column 1.

ThatXliner avatar Apr 18 '21 20:04 ThatXliner

It is supposed to work. Can you give a short example script?

MegaIng avatar Apr 18 '21 21:04 MegaIng

on_error was broken before, but now I fixed it. So if you were using it, you should try again with the latest commit.

erezsh avatar Apr 18 '21 23:04 erezsh

Now it's just really slow. Or maybe it's just my grammar....

EDIT: Ok, never mind. Apparently, it was kinda hanging when I did

(value (_NL|$|_DEDENT)+ )

but removing _DEDENT works...

(I'm making a programming language-type thing. It's whitespace sensitive)

SECOND EDIT:

So I'm trying to make an error and there we go:

Traceback (most recent call last):
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 127, in feed_token
    action, arg = states[state][token.type]
KeyError: 'IDENTIFIER'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/bryanhu/projects/langs/tylaireum/tylaireum/__main__.py", line 79, in <module>
    print(parser.grammar.parse(INLINE).pretty())
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/lark.py", line 552, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parser_frontends.py", line 107, in parse
    return self.parser.parse(stream, start, **kw)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 42, in parse
    return self.parser.parse(lexer, start)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 176, in parse
    return self.parse_from_state(parser_state)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 193, in parse_from_state
    raise e
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 184, in parse_from_state
    state.feed_token(token)
  File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 130, in feed_token
    raise UnexpectedToken(token, expected, state=self, interactive_parser=None)
^Clark.exceptions.UnexpectedToken: <exception str() failed>

But as you can see, I had to CTRL-C to stop it. It was just hanging

I am too lazy to send a minimal reproducible example right now, so if you want my grammar, etc, I'll send the whole enchilada.

ThatXliner avatar Apr 20 '21 00:04 ThatXliner

@ThatXliner Thanks for finding that. I think I fixed it, so try now.

erezsh avatar Apr 20 '21 01:04 erezsh

nope.

EDIT: Though I noticed it wasn't hanging for valid input this time

ThatXliner avatar Apr 20 '21 02:04 ThatXliner

@ThatXliner Ok then, hit me with your enchilada. But make sure it's easy to see it fail.

erezsh avatar Apr 20 '21 02:04 erezsh

Probably it was hanging for valid input when I sent my last comment but I most likely didn't double-check. Now, when I make it valid, it hangs.

Here's the enchilada: https://gist.github.com/ThatXliner/96829ce180b5659198a822cf74b6c309

Tell me if you can reproduce or not

ThatXliner avatar Apr 20 '21 04:04 ThatXliner

I found a minimal breaking example:

from lark import Lark

p = Lark("""
    ?start: "foo" x "bar"
    x: $ x
""", parser="lalr")

p.parse("foo")

The infinite loop happens because $ is accepted forever, but has no valid route to unwind the stack.

I need to think about how to solve this. My current intuition is that this grammar shouldn't even be allowed. But I'm not sure how easy it is to detect reliably.

But maybe I just need to rethink the loop mechanism, and solve it by only providing the end token once.

erezsh avatar Apr 20 '21 15:04 erezsh

I'm wondering how you're implementing the end token for right now. I think it's better to parse normally, get an EOF error, see if the grammar expects it and if so, continue parsing. Unless that's what you're doing right now.

ThatXliner avatar Apr 20 '21 16:04 ThatXliner

So, it seems that this can be "solved" by testing that the parser only shifts once on END, and otherwise throwing a run time error for infinite loop.

But I don't like it. Lark shouldn't allow to compile a grammar that can never run to completion. But, I'm not sure how to detect it at compile time.

erezsh avatar Apr 20 '21 16:04 erezsh

Hi Do you have plan this PR to be merged into the master soon? Regards!

seimit avatar Jun 01 '21 17:06 seimit

@seimit There are issues with this PR, so currently no.

The problem is that in some edge cases, the parser gets stuck in a loop forever. Which is something I never want Lark to be able to do.

erezsh avatar Jun 01 '21 17:06 erezsh

@erezsh Shouldn't it be possible to add some early point reject all BNF rules that have a Terminal following an $ symbol? E.g. if we find a rule of the form

a: $ A

we completely throw it away, since it can never actually match.

For rules of the form

a: $ b

We need to check if b can be empty, and if it can not be empty we throw the rule away. I think we check during Grammar Analyzing anyway if a rule can be empty.

MegaIng avatar Sep 30 '21 12:09 MegaIng

It's a good point that we can detect it during grammar analysis. And then we can just throw an error if anything follows the $END symbol. I don't think there's a good enough reason to allow them, empty rules or not.

erezsh avatar Sep 30 '21 13:09 erezsh

Yes there is. ($|_NEWLINE)* should be allowed. I think that would break if we don't allow it.

MegaIng avatar Sep 30 '21 13:09 MegaIng

OTOH, maybe it shouldn't be allowed. In effect it is the same as _NEWLINE* $?. I am not sure. But we need to make sure that for example the python grammar works when everywhere were we accept a statement ending _NEWLINE we also accept a $.

But also, can we somehow detect this:

a: end A
end: $

?

MegaIng avatar Sep 30 '21 13:09 MegaIng

If we do it at the grammar analysis level, we can make sure FOLLOWS[$] is empty. That should take care of all edge cases (I think)

erezsh avatar Sep 30 '21 14:09 erezsh

Hmm actually, no, that's the crux of the problem, which I guess is why I left it open.

On one hand, the grammar has to allow the option of something after an end token, otherwise, there's no advantage in implementing it. On the other hand, it's not obvious how to block the bad uses while allowing the good ones.

erezsh avatar Sep 30 '21 14:09 erezsh

The point is more that there is that it is possible to accept nothing after $. Not sure how that can be expressed.

MegaIng avatar Sep 30 '21 15:09 MegaIng

Throwing in my $0.02, just to say that this feature would be useful to me.

Erotemic avatar Jul 27 '23 17:07 Erotemic