lark icon indicating copy to clipboard operation
lark copied to clipboard

Locally disable %ignore

Open Kyuuhachi opened this issue 3 years ago • 15 comments

It is common that languages have some sections, such as strings or floats, which have completely different lexical syntax (w.r.t. whitespace and such) than everything else.

The current approach appears to be to define long unwieldy regexes and capturing the entire string as one token, as in python3.lark. This is difficult to understand, and requires heavy postprocessing. It also only works if the grammar is regular, which makes string interpolation difficult to parse. (I think cPython works around this by parsing out {} sections in a second pass, which heavily restricts nesting.) One possible workaround would be using four different regexes (quote-to-quote, quote-to-splice, splice-to-splice, and splice-to-quote), but this is not a satisfactory solution.

It is of course possible to define a string as

string: "\""  strchar* "\""
strchar: /[^"\\]/
       | "\\\"" -> escape_quote
       | "\\\\" -> escape_backslash

But that allows %ignore rules into the string, which is very much unwanted. Would it be possible to add some kind of notation to prohibit %ignores inside particular sections? Some options would be surrounding a set of rules with some pragma such as %ignore_off%ignore_on or %protect%unprotect, or adding some prefix for space-sensitive rules, such as ~string: ….

One thing that needs to be considered if this were to be added is how to deal with post-lexers' always_accept and similar features. Most usecases I can think of are similar to %ignore and should thus be restricted, but there might be cases I haven't thought of.

Another consideration is maybe not just having a blanket "stop all ignore rules", having some kind of more fine-grained control. It's not unthinkable that one could want a whitespace-sensitive language, but containing sections of lisp or something, which has a different set of ignore rules.

Kyuuhachi avatar Jul 23 '20 01:07 Kyuuhachi

I can see how more fine grained control of %ignore might be useful.

I can suggest adding the following syntax:

%ignore WS, COMMENT {
  // list of rules

  %ignore COMMENT2, -WS {
	// list of rules, ignoring only COMMENT and COMMENT2
  }
} 

I don't think always_accept would pose a big hindrance. Actually, adding a 'scoped ignore' feature might be a way to get rid of this hack.

However, that's all in theory. In practice, I'm not sure 'scoped ignore' would be an easy feature to add.

But I'll be happy to accept such a PR, and help in any way I can.

erezsh avatar Jul 23 '20 07:07 erezsh

I just took a look at how ignores are implemented, and yeah, it seems rather complicated to add. It appears that the ignore set is handled entirely in the lexer (plus something in xearley that I don't quite understand), which might be workable with the contextual lexer — I'm not quite clear on its capabilities. Essentially, I think what would need to be done is to associate a set of ignored tokens to each parser state rather than the grammar as a whole. Another option would be to transparently rewrite all rules from expr: expr "+" expr to expr: expr _IGNORE* "+" _IGNORE* expr, but that would probably lead to worsened performance, if nothing even worse.

I don't quite like that proposed syntax, since it would essentially require 80% of the grammar to be enclosed in those braces. I suppose the globally scoped braceless syntax would still be available (for backward compatibility if nothing else), but in that case you'd essentially have to duplicate the ignore spec so that you can explicitly subtract each of them when doing the strings.

always_accept might be a hack, but having a way to tell the lexer that a token can appear (mostly) anywhere but still be accessible to the post-lexer is very useful at times (indentation-sensitive languages, mostly).

Kyuuhachi avatar Jul 23 '20 11:07 Kyuuhachi

Well, one approach might be to assign an "ignore set" for each rule. And then in the contextual lexer (or in Earley) ignore according to the union of those sets, for each rule currently being observed (which can be precomputed per state).

Regarding the syntax, I can't think of a better idea. How can you have a grammar within a grammar, without enclosing them in an explicit scope?

Currently always_accept is used by the indenter as a substitute for changing the "ignore" set at run-time. I'm not sure if it's useful for anything else.

erezsh avatar Jul 23 '20 12:07 erezsh

Yeah, that thing with ignore sets per rule was basically what I was going for. Guess I'll have to look into exactly how rules and states correspond.

Yeah, I guess that syntax is basically as good as it can get. It's not the prettiest, but it's readable enough. One complaint is that it makes terminals look like they're scoped as well, which I don't think is the intention.

I don't really care exactly what always_accept does, but I think we both agree that having some way to parse indentation is essential.

I'll see if I can draft up a pull request in a few days.

Kyuuhachi avatar Jul 23 '20 16:07 Kyuuhachi

Great, looking forward to it! Let me know if you have questions regarding the code.

erezsh avatar Jul 23 '20 16:07 erezsh

Currently trying to add the new forms to the meta-grammar, but adding

    'statement': ['ignore', 'import', 'declare', 'ignore_scoped'],
    'ignore_scoped': ['_IGNORE ignore_scoped_list _LBRACE _list _RBRACE _NL' ],
    'ignore_scoped_list': ['_ignore_scoped_list'],
    '_ignore_scoped_list': ['ignore_scoped_item',
                            '_ignore_scoped_list _COMMA ignore_scoped_item'],
    'ignore_scoped_item': ['atom', 'MINUS atom'],

makes non-scoped ignores unable to match, and I can't figure out why.

Kyuuhachi avatar Jul 23 '20 22:07 Kyuuhachi

Also another alternative syntax is

%ignore /\s+/
%scoped {
    %unignore /\s+/ // Not sure if anonymous regexes are deduplicated, so this might not work
    %ignore / +/
}

This one makes sense of putting %ignore inside scopes, and extends more easily to other future scoping extensions if such were to be added. (It also neatly sidesteps the issue in the previous comment.)

Kyuuhachi avatar Jul 23 '20 22:07 Kyuuhachi

You can write it with %ignore_scoped, to avoid collisions, and I'll find a way to merge them into one command later on.

erezsh avatar Jul 24 '20 08:07 erezsh

I think I'll go with %scoped plus %unignore if that's okay with you. Seems more consistent with the existing grammar (the usage of minus feel a bit strange, and comma separation differs from how %ignore is normally used), and easier to implement.

Also, I'm not sure how scoped ignores should interact with %include. From what I understand, %ignore statements in included files are currently discarded, and the included rules use the include rules from the surrounding file instead. Retaining that behavior wouldn't be overly difficult (but would in that case have to discard scopes as well, which is not very intuitive. Alternatively I could make them keep the ignores from the file/scope they're defined in (without inheriting anything from the including scope), but this might break a few existing grammars, and would make including rules quite a bit more limited, since rules no longer adapt to fit their host.

Kyuuhachi avatar Jul 24 '20 11:07 Kyuuhachi

I think it makes sense that importing a rule will import its ignore-settings. That also makes sense in the code, if you store them in RuleOptions.

I can't imagine there are many situations in which you would want to re-use the same rule with different ignore-sets.

Regarding backwards-compatibility, just add a flag for it like import_ignore_settings=False. Lark is nearing its 1.0 release, so if it works well it could become the default behavior soon.

erezsh avatar Jul 24 '20 12:07 erezsh

One place where such a thing would make sense would be a with templates, such as a combinator library:

_sep{_s, v}: (v _s)* v
_sep1{_s, v}: (v _s)+ v

Then you would normally want the host's whitespace rules rather than the guest's. That's a rather artificial example though; I don't know if anyone would actually write it like that in practice.

(I haven't gotten quite far enough to test, but I think templates will automatically use lexical scoping without any extra effort on my part.)

Having a compatibility flag sounds like a decent solution.

Kyuuhachi avatar Jul 24 '20 13:07 Kyuuhachi

That's a good point. It might be nice to add some support for that afterwards, like specifying "inherit-ignore" somehow. But I think that should be the last thing to worry about.

erezsh avatar Jul 24 '20 13:07 erezsh

Adapting the parsers is proving quite tricky... For the LALR parser, I can't figure out how to find the ruleset corresponding to each integer state id, and for xearley, I don't even know how to start.

Anyway, I'll take some time off for now. I drafted a pull request (#631), though it doesn't contain anything all that impressive yet.

Kyuuhachi avatar Jul 24 '20 18:07 Kyuuhachi

For LALR, I think that rather than messing with the lexer, it's better to add a third action besides shift and reduce to the parser, which changes the state without pushing anything new to the stack. This will as a side effect send ignored tokens through the postlexer, which I think renders always_accept obsolete.

Inserting those rules into the table was not difficult; however I'm having a hard time inserting reduce rules when the lookahead is an ignore token. More concretely, I can't figure out what nonterminal_transitions, reads, directly_reads, lookback, and includes mean, and I can't find much by googling, either. Any guidance would be gratefully accepted.

Is it better to discuss such things here or on the PR?

EDIT: Thinking again, that might cause something akin to a shift/reduce conflict in grammars such as

%ignore " "
a: b (WS) "x"
b: "a" (WS) "b"
 | "a" // (WS) represents where ignore tokens are accepted

since {<b : "a" • (WS) "b">, <b : "a" • >} could semi-shift to {<b : "a" • (WS) "b">} or reduce to <a : b • (WS) "x">. So that's not gonna work. But ignoring the tokens in the lexer would make the grammar

start: a | b
%scoped {
  %ignore "1"
  a: "x" "y"
}
%scoped {
  %ignore "2"
  b: "x" "z"
}

accept the strings x2y and x1z. So I'm kinda starting to doubt whether this is possible in LALR. Haven't investigated much about Earley yet, but that doesn't support postlexer I think.

Kyuuhachi avatar Jul 26 '20 23:07 Kyuuhachi

I think those conflicts could be avoided if instead of allowing ignored tokens between any two rules, we allow it after any terminal. That way, ignored tokens would always shift. However, that would behave incorrectly if a rule ends on a nonterminal from another scope (taking NULLABLE into account), which would need to be rewritten from

%scoped {
    %unignore WS
    string : "\"" strchar* "\""
}

to

string: _string "\""
%scoped {
    %unignore WS
    _string : "\"" strchar*
}

Might be possible to define some rewrite rule that does this automatically, but I think that could in some cases essentially mean copying the entire grammar several times over if there are enough nullable rules and scopes. That probably won't happen in real grammars, so it may or may not be worth worrying about.

Kyuuhachi avatar Jul 30 '20 12:07 Kyuuhachi