lark icon indicating copy to clipboard operation
lark copied to clipboard

Indenter doesn't work intuitively with ignored tokens

Open ColonelThirtyTwo opened this issue 3 years ago • 9 comments

Describe the bug

Indenter is cited as the way to parse whitespace-sensitive languages, but it has unintiutive and obtrusive behavior when there is an ignored token (ex a comment) in the middle of a newline sequence.

To Reproduce

Example:

GRAMMAR = r"""
	%import common.WS_INLINE
	%import common(C_COMMENT, CPP_COMMENT, SIGNED_NUMBER)

	%declare _INDENT _DEDENT
	%ignore WS_INLINE
	%ignore CPP_COMMENT
	%ignore C_COMMENT
	_NL: /(\r?\n[\t ]*)+/

	VARNAME: /[_a-zA-Z][_a-zA-Z0-9\/]*/

	start: [_NL] stmt*
	stmt: VARNAME _NL [_INDENT stmt+ _DEDENT]
"""

from lark import Lark
from lark.indenter import Indenter

class TreeIndenter(Indenter):
	NL_type = '_NL'
	OPEN_PAREN_types = []
	CLOSE_PAREN_types = []
	INDENT_type = '_INDENT'
	DEDENT_type = '_DEDENT'
	tab_len = 4

parser = Lark(GRAMMAR, postlex=TreeIndenter(), parser="lalr", start="start")
data = """
foo
bar
	/* */
	baz
"""
for tok in parser.lex(data):
	print(repr(tok))
print(parser.parse(data).pretty())

Remove the /* */ from the data variable and it parses ok.

Happens because the ignored comment splits up the _NL tokens, and the indenter does not coalesce them.

ColonelThirtyTwo avatar Mar 24 '21 20:03 ColonelThirtyTwo

The recommended solution is to capture comments in the newline.

Something like:

_NL: (NEWLINE | COMMENT)+

If you have a better way, we'll be happy to hear it.

erezsh avatar Mar 24 '21 20:03 erezsh

@erezsh Thanks for the quick response. Looks like that works.

Might be worth putting that in the docs somewhere - seems like a gotcha. I think that the Indenter could be altered to coalesce adjacent _NL tokens too, so that the alteration isn't needed.

ColonelThirtyTwo avatar Mar 24 '21 20:03 ColonelThirtyTwo

I think you're right, it might be possible to do so. It would prevent languages in which a blank space is a dedent, but perhaps those aren't very common.

Anyway, in terms of performance, the current solution works best.

We can keep this issue open, while I consider the best option. If the code remains the same, I agree we should mention this somewhere in the docs.

erezsh avatar Mar 24 '21 20:03 erezsh

@erezsh It might in general be worth considering creating a FAQs/best practices/common misunderstanding page.

MegaIng avatar Mar 24 '21 21:03 MegaIng

@MegaIng What would you put there? It doesn't seem like there are a lot of repetitive issues, since most of them are solved in the code.

erezsh avatar Mar 24 '21 22:03 erezsh

Not necessarily repetitive (right now), but tricks that can't really be fixed and should be documented somewhere. Of course, we can just keep them in issues, but I think adding a page in the docs for them is worth it. #857 (which is actually a 'duplicate' of #517), #841, this issue, #838, #833, etc. (+ stuff from gitter, which is even less searchable than github issues). Most of them were just answered with a short text, explaining what is going on and how to fix the grammar. These could all just be formulated into a "FAQ" page. If you don't want this it's fine, but I think it is worth it.

MegaIng avatar Mar 24 '21 22:03 MegaIng

@MegaIng I have no objection. I you want to write such a page I'll add it.

Maybe some of those can fit in https://lark-parser.readthedocs.io/en/latest/how_to_use.html or https://lark-parser.readthedocs.io/en/latest/recipes.html

erezsh avatar Mar 24 '21 23:03 erezsh

@erezsh

The recommended solution is to capture comments in the newline.

Something like:

_NL: (NEWLINE | COMMENT)+

That doesn't seem to work in my case where the comments are SH_COMMENTS because the comment eats the \n before NEWLINE sees it.

Any suggestions for a grammar that

  • uses SH or Python style comments
  • uses indentation for structure and therefore uses Indenter

julie777 avatar Apr 05 '22 00:04 julie777

@julie777 That's what the official Python grammar does.. https://github.com/lark-parser/lark/blob/master/lark/grammars/python.lark

erezsh avatar Apr 05 '22 07:04 erezsh