lark
lark copied to clipboard
Reimplementation of end symbol (Issue #237)
I don't remember the problem, only that it messed some grammars up. But I changed the LALR logic since, made it a little cleaner, so maybe that was solved on the way?
I'll make sure to do thorough tests before merging it in.
I'm trying to test this right now. What's the terminal to use? $END?
No, It is just $
Maybe I'm going too far but is this supposed to work with custom Indenters? Because I'm facing a problem where
lark.exceptions.UnexpectedToken: Unexpected token Token('__$END$__', '') at line 1, column 1.
It is supposed to work. Can you give a short example script?
on_error was broken before, but now I fixed it. So if you were using it, you should try again with the latest commit.
Now it's just really slow. Or maybe it's just my grammar....
EDIT: Ok, never mind. Apparently, it was kinda hanging when I did
(value (_NL|$|_DEDENT)+ )
but removing _DEDENT works...
(I'm making a programming language-type thing. It's whitespace sensitive)
SECOND EDIT:
So I'm trying to make an error and there we go:
Traceback (most recent call last):
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 127, in feed_token
action, arg = states[state][token.type]
KeyError: 'IDENTIFIER'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/bryanhu/.pyenv/versions/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/bryanhu/projects/langs/tylaireum/tylaireum/__main__.py", line 79, in <module>
print(parser.grammar.parse(INLINE).pretty())
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/lark.py", line 552, in parse
return self.parser.parse(text, start=start, on_error=on_error)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parser_frontends.py", line 107, in parse
return self.parser.parse(stream, start, **kw)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 42, in parse
return self.parser.parse(lexer, start)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 176, in parse
return self.parse_from_state(parser_state)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 193, in parse_from_state
raise e
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 184, in parse_from_state
state.feed_token(token)
File "/Users/bryanhu/projects/langs/tylaireum/lark/lark/parsers/lalr_parser.py", line 130, in feed_token
raise UnexpectedToken(token, expected, state=self, interactive_parser=None)
^Clark.exceptions.UnexpectedToken: <exception str() failed>
But as you can see, I had to CTRL-C to stop it. It was just hanging
I am too lazy to send a minimal reproducible example right now, so if you want my grammar, etc, I'll send the whole enchilada.
@ThatXliner Thanks for finding that. I think I fixed it, so try now.
nope.
EDIT: Though I noticed it wasn't hanging for valid input this time
@ThatXliner Ok then, hit me with your enchilada. But make sure it's easy to see it fail.
Probably it was hanging for valid input when I sent my last comment but I most likely didn't double-check. Now, when I make it valid, it hangs.
Here's the enchilada: https://gist.github.com/ThatXliner/96829ce180b5659198a822cf74b6c309
Tell me if you can reproduce or not
I found a minimal breaking example:
from lark import Lark
p = Lark("""
?start: "foo" x "bar"
x: $ x
""", parser="lalr")
p.parse("foo")
The infinite loop happens because $ is accepted forever, but has no valid route to unwind the stack.
I need to think about how to solve this. My current intuition is that this grammar shouldn't even be allowed. But I'm not sure how easy it is to detect reliably.
But maybe I just need to rethink the loop mechanism, and solve it by only providing the end token once.
I'm wondering how you're implementing the end token for right now. I think it's better to parse normally, get an EOF error, see if the grammar expects it and if so, continue parsing. Unless that's what you're doing right now.
So, it seems that this can be "solved" by testing that the parser only shifts once on END, and otherwise throwing a run time error for infinite loop.
But I don't like it. Lark shouldn't allow to compile a grammar that can never run to completion. But, I'm not sure how to detect it at compile time.
Hi Do you have plan this PR to be merged into the master soon? Regards!
@seimit There are issues with this PR, so currently no.
The problem is that in some edge cases, the parser gets stuck in a loop forever. Which is something I never want Lark to be able to do.
@erezsh Shouldn't it be possible to add some early point reject all BNF rules that have a Terminal following an $ symbol? E.g. if we find a rule of the form
a: $ A
we completely throw it away, since it can never actually match.
For rules of the form
a: $ b
We need to check if b can be empty, and if it can not be empty we throw the rule away. I think we check during Grammar Analyzing anyway if a rule can be empty.
It's a good point that we can detect it during grammar analysis. And then we can just throw an error if anything follows the $END symbol. I don't think there's a good enough reason to allow them, empty rules or not.
Yes there is. ($|_NEWLINE)* should be allowed. I think that would break if we don't allow it.
OTOH, maybe it shouldn't be allowed. In effect it is the same as _NEWLINE* $?. I am not sure. But we need to make sure that for example the python grammar works when everywhere were we accept a statement ending _NEWLINE we also accept a $.
But also, can we somehow detect this:
a: end A
end: $
?
If we do it at the grammar analysis level, we can make sure FOLLOWS[$] is empty. That should take care of all edge cases (I think)
Hmm actually, no, that's the crux of the problem, which I guess is why I left it open.
On one hand, the grammar has to allow the option of something after an end token, otherwise, there's no advantage in implementing it. On the other hand, it's not obvious how to block the bad uses while allowing the good ones.
The point is more that there is that it is possible to accept nothing after $. Not sure how that can be expressed.
Throwing in my $0.02, just to say that this feature would be useful to me.