pest icon indicating copy to clipboard operation
pest copied to clipboard

RFC: Make white-space handling less confusing / more consistent with the introduction of an "adjacent selector": `,`

Open Kroc opened this issue 5 years ago • 34 comments

Summary

Make white-space handling less confusing / more consistent with the introduction of an "adjacent selector": ,

Motivation

  • The rules soi, eoi, white-space & comment look like any other rule, and with whitespace and comment being names that users are extremely likely to use themselves, the hidden behaviour can be confusing, unexpected and even impractical for the user's desire

  • The implicit behaviour of the ~ selector only when a whitespace rule is defined, is sometimes not what the user wants and they have to resort to the @ modifier even though this effects the entire rule and not just one selector. This can make it very difficult to achieve certain effects without resorting to nested rules

Guide-level explanation

Adjacent elements in a rule can be separated by a ",":

rule = { a, b, c }

Where rule a is followed by rule b which is followed by rule c. No white-space is assumed between the rules, though note that any rule can itself contain a whitespace rule.

Contrast this with:

rule = { a ~ b ~ c }

Which will check for optional white-space and comments between rules a, b & c. The adjacent (,) and white-space selector (~) can be combined in a rule; this can be used to carefully control the automatic assumption of white-space or comments.

rule = { a, b ~ c }

The behaviour of the adjacent selector remains consistent within nested rules making use of modifiers. In the below example, the behaviour of rule a is unaffected by the parent rules utilising different white-space moderators.

rule1 = @{ a }
rule2 = ${ a }
rule3 = !{ c }

a = { b, c }

Reference-level explanation

I can't comment on the inner workings of Pest, as I haven't even completed my first ever Rust program yet, but based on writing parsers in other languages (VB6, Go, Perl6), the adjacent selector should provide minimum difficulty. It doesn't add functionality, and expresses a natural state of one token following another.

Drawbacks

  • Even with the addition of such a simple feature there is still the cost of implementation, testing, documentation and compatibility

  • There could be unforeseen consequences with parsing behaviour with complex combinations of features; adding another feature increases the available complexity

Rationale and alternatives

This is the simplest possible design to resolve a need of adjacent selection without changing the behaviour of existing selectors (for compatibility). The choice of actual character used (,), the name, the terminology etc, can be debated.

In some languages (e.g. PHP, some functional languages) a dot is used for concatenation:

alternative = { a . b . c }

This may be desirable over a comma as it is more explicitly a "operator" between words, where as a comma could be confused for something more general and peppered by the user in places where it shouldn't be.

Adjacency could be communicated without the use of a character where one rule separated from another rule by white-space implies adjacency; e.g.:

alternate = { a b c }

Whilst this form exists in BNF and some derivatives, the use of an explicit separator avoids potential parse-errors or unintentional behaviour from the user and also provides visual balance as every rule is always separated in all cases, regardless of separator.

Alternative options include changing the functionality of existing features, such as how the "~" selector or "@" modifier operates.

The automagic behaviour of rules with specific names is a documentation / support / learning hindrance, but I believe that that can be resolved separately outside of this RFC as they will have much more drastic implications.

The proposed feature adds to the project, provides benefits, whilst also not taking away from existing features. At the cost of additional documentation, it may help users avoid issues starting out.

Prior art

The use of a comma for adjacency is present in EBNF, which Pest roughly follows.

Unresolved questions

?

Kroc avatar Sep 05 '18 20:09 Kroc

Added . alternative. In some ways I actually prefer this over the comma, but additional symbols may want to be reserved for future functionality.

Kroc avatar Sep 06 '18 11:09 Kroc

👍

The context-dependent meaning of ~ is troublesome. In some cases, it can even mean different things in a single rule:

double_meaning = { "a" ~ "b" } // when used directly, matches whitespace
indirect = @{ double_meaning } // when used indirectly, doesn't match whitespace

whitespace = { " " }

As an extension, I would consider changing ~ to always mean "match whitespace", and introducing a second operator to always mean "don't match whitespace". Then atomic modifiers would relate only to which rules appear in the output.

wirelyre avatar Sep 06 '18 20:09 wirelyre

Unfortunately, this doesn't take into consideration the other combinators like + or * which would also need a way to deal with the WHITESPACE issue. A solution would need to be proposed for these as well.

Apart from the drawbacks already mentioned, there's also the fact that there is some duplication of functionality when WHITESPACE is not defined, i.e. ~ and , do the same thing. How would one explain this? Of course, it might make sense to have ~ fail if WHITESPACE is not defined.

For the particular case of ~, maybe it makes to have it defined as a rule ~ = { " "* }?

Another alternative we could take into consideration is to separate non-whitespace from the other modifiers and have it work inside of the rules.

rule = { @(a ~ b*) ~ c }

Above, only the last ~ will accept WHITESPACE. It also probably makes sense to keep the cascading effect as well so that one doesn't have to write @ too often.

dragostis avatar Sep 08 '18 04:09 dragostis

For the purposes of this RFC, and to avoid going off on a bike-shedding tournament, we should avoid discussion of how best to change / replace the implicit whitespace rule / ~ here; that belongs in its own RFC, and I will probably do that in about a weeks time unless anybody else wants to go ahead and do that.

This RFC is for the discussion of an "adjacent selector" (I don't know if "selector", "operator" or "combinator" is the correct term for these things, it's never stated in the documentation). If we get into discussing alternate strategies for the whitespace rule, we'll literally never get this simple feature request in.

As Pest currently stands for this RFC, the ~ will operate differently depending on the rule modifier of that rule or any parent rule. This is still useful for some purposes. Again, alternate designs for ~ / whitespace should be discussed on a different RFC.

(sorry to be nagging, but I just want a simple adjacent operator as soon as possible, and worry about the whitespace weirdness after that)

Kroc avatar Sep 08 '18 07:09 Kroc

I agree that this could lead to a really long and possibly not as productive discussion, but the issue here is that adding an "adjacent selector" would only add complexity to the grammar without removing whitespace-dependent behaviour of atomic rules, since ~ operators would still function the same complicated way.

While I do agree that the current design may not be the most intuitive, I think that the best way forward would be to only change the grammar when absolutely needed. Maybe a good idea would be to implement this as a prototype, or simply argue a bit more about the case where adding this features plays well with all the other features and is a good addition by itself.

And don't get discouraged! The issue your brought up very much exists and I'm excited to find a solution together.

dragostis avatar Sep 08 '18 16:09 dragostis

I have to admit that I'm very excited about this RFC, because I think the discussion could potentially make pest syntax more consistent and easier to learn.

In that spirit: another option would be to introduce a modifier to the operators themselves to indicate that they take optional whitespace. Something like a ~& b and ( expr )+&.

That looks a bit ridiculous, now that I see it written out. Clearly I've been using too much J.

wirelyre avatar Sep 09 '18 01:09 wirelyre

@wirelyre, I'm really excited about this too. I feel like this and pest_deconstruct will make pest an amazing parser. The issue here is that it's hard to find a syntax that is simple enough.

dragostis avatar Sep 09 '18 03:09 dragostis

This is sort of co-opting the RFC, but bringing this up in Gitter led to a discussion around Pest's whitespace handling and I came up with a proposal overhaul of how Pest handles whitespace that streamlines behavior.

To summarize:

  • a b is the strict sequence of a and b
  • a | b is the ordered choice of a then b
  • a ~ b is the whitespace/comment-tolerant sequence of a and b
  • &a is the positive predicate
  • !a is the negative predicate
  • a? is the repetition of a zero-or-one times
  • a* is the whitespace/comment-tolerant repetition of a zero-or-more times
  • a+ is the whitespace/comment-tolerant repetition of a one-or-more times
  • $(a*) is the strict repetition of a zero-or-more times
  • $(a+) is the strict repetition of a one-or-more times
  • _(a) omits a but produces a's children
  • @(a) produces a but omits a's children (transitively)

Those twelve rules are enough to define all of the matching behavior (excluding discussion of builtins). You can @ me on Gitter if you want to discuss this idea further, but I'm documenting it here.

I think that draft is both actionable and covers all of the expressiveness desired. (Interestingly, ~ behaves essentially as a normal sequenced rule here.) The only issue I really see rising up might be associativity definitions. Instead of using lack of a separator for sequence as in "pure" PEG, we could also use any of ,.:#%^- and retain the simplicity of this model, just with an explicit "strict sequence" operation and "loose sequence" operation instead of adjacency sequencing and ~ just being another adjacent production.

This is just speculative by me at 3 AM, though, so take it just as that -- an idea put onto the floor for discussion.

CAD97 avatar Sep 14 '18 07:09 CAD97

I can't find the discussion, but I'd like to record what I do remember from discussion here:

Consensus was that having a "pure adjacency" available is very useful, and should definitely be added. - was suggested and no better alternatives were proposed.

, can't be used, as parameterized rules want to use , to separate parameters. (They could use ; instead, but having , allowed but not separate parameters in this position would be confusing.)

There are no specific plans on using . that would preclude its use for this purpose, but - is beneficial in that it appears more "binary operator"-like, and a visual similarity to ~ suggests a similar purpose. This also brings in a connotation that ~ is - but "roundabout" or "loose".

a - b ~ c would then expand to the functional equivalent of a - b - (WHITESPACE|COMMENT)* - c.

In order to allow repetitions to be strict, I believe the winning proposal was to treat it somewhat like how the repetitions are handled in Rust macros: rule-* would be strict repetition, rule~* would be loose repetition, and rule* would continue to be loose repetition.


Changes not part of this RFC's motivated changes, but on the table for consideration for a 3.0:

  • Changing @ away from the current definition and instead just suppressing child rule emission
  • Changing ~'s behavior dependence on @
  • Linting use of ~ when neither WHITESPACE or COMMENT are defined
  • Removing ~'s implicit behavior based on WHITESPACE and COMMENT, instead defining the trivia by assigning to it directly
  • Changing WHITESPACE and COMMENT's intrinsic atomicity

A very vague outline of how to go about adding this - "strict adjacency" operator for 2.1:

  • Adjust the grammar to allow for strict versus loose adjacency and repetitions
  • Adjust the AST to allow for strict versus loose adjacency and repetitions
  • In the processing passes done on the grammar, loose adjacency/repetition transforms into strict adjacency/repetition with whitespace/comment rules added as required

CAD97 avatar Oct 14 '18 01:10 CAD97

I very much like introducing these rules as "experimental" for 2.1 with possibly an opt-in flag. That should make the transition to 3.0 much leaner.

dragostis avatar Oct 14 '18 02:10 dragostis

+1 from me. I'd also like to suggest an operator that says whitespace is required between two rules. I can't do this in normal mode with whitespace defined and adding ... ~ whitespace ~ ..., because it will be consumed by the preceding implicit rule first.

(~~Pest's "greedy" parsing is quite confusing to me as a new user. I didn't see anything about it in the documentation and I initially expected some kind of non-determinism that could figure out conditions like this. Ran into it in another place where I listed terminals first in a union of all expressions and it only matched the first term of the expression, but all of that is a topic for another issue.~~ EDIT: Found #337 , all is well)

Back to the whitespace topic: There is no option for mandatory whitespace AFAIK except to use atomics, manually add whitespace+ in between, and completely contain the atomic's effects in the current rule by encapsulating sub-rules in !{ }. Needless to say, that's pretty long for something so simple.

agausmann avatar May 03 '19 08:05 agausmann

The current plan is to provide - for simple adjacency, ~ for optional trivia, and ^ for mandatory trivia.

That said, what use cases do you have for mandatory trivia? I've personally found none (other than "longest match" lexing) that are actually necessary, due to traditional lexers stripping out trivia before passing along to the parser stage.

The "longest match" lexing is easy to think that it should be achieved via mandatory trivia, e.g. "if" ^ expr. But actually, this prevents if(⟨expr⟩), which would be valid with a traditional lexer. The proper solution here is to have a keyword production that asserts with lookahead that it's not followed with a character that could continue the identifier, e.g. ident = @{ id_start - id_cont-* } if = @{ "if" - !id_cont }.

CAD97 avatar May 03 '19 16:05 CAD97

@CAD97 There are languages where whitespace matters. The best practical example I can give is Haskell, where whitespace is a token used as the function application operator - foo x is roughly equivalent to Rust's foo(x). I have a language that has this same syntax. Currently I just use optional trivia and it works fine, but it lets you write something like foo[1, 2]{3: 'bar'}, which should technically be foo [1, 2] {3: 'bar'}.

Later update: Whitespace between arguments in Haskell is apparently not required, I was mistaken.

agausmann avatar May 03 '19 18:05 agausmann

The difference is that when whitespace is semantic like that, I don't see it as being trivia. I admit that greater control over the trivia is a great option to be had, but for lack of a better way of putting it, it seems like an abuse of trivia to use it semantically (beyond not allowing it in terminals, even when they're built from pieces).

For example, does Haskell allow foo{- block comment -}x? I didn't think so, but apparently it does. If you have comments as trivia and you just require trivia between items for function application, then it will allow this.

CAD97 avatar May 03 '19 18:05 CAD97

Sure, that absolutely makes sense. I think I'd personally be fine with defining something like wsp = @{ (WHITESPACE | COMMENT)+ } and using it explicitly with the new adjacent operator. Still possible now with atomics as I've described before, just weirder.

The implicit whitespace matching by default is a nice feature but it can easily go unnoticed and quietly cause trouble if you're not careful about disabling it where it's not wanted. The proposals in #333 are a good step toward making that distinction clearer.

agausmann avatar May 04 '19 01:05 agausmann

The current plan is to provide - for simple adjacency, ~ for optional trivia, and ^ for mandatory trivia.

This is great! Is there a time line for the feature?

My use case is requiring a new line at a certain point in a grammar. At all other points, new lines are optional (and therefore white space).

ejoebstl avatar May 16 '19 13:05 ejoebstl

@ejoebstl, I have the parser working, but there's more to it:

  • [x] parser
  • [ ] validator
  • [ ] optimizer
  • [x] generator
  • [ ] vm

I have been quite busy lately since I moved to a new country and I'm currently getting settled. Hopefully I'll get some stuff done this weekend.

dragostis avatar May 16 '19 18:05 dragostis

@dragostis Is there any way I could help with the current implementation? Maybe you could put the current state into a feature branch and I'd try to tackle one or more of the missing tasks?

ejoebstl avatar Jun 15 '19 20:06 ejoebstl

@ejoebstl, oh, yes! I'll upload my progress here tomorrow. I have the whole parser written and would love some help.

dragostis avatar Jun 15 '19 21:06 dragostis

@ejoebstl, here's the repo: https://github.com/pest-parser/pest3 For now, work can live there. The next logical step would be to port the validator to the new version of meta.

dragostis avatar Jun 16 '19 18:06 dragostis

#118 is closed but I do not see it here. Unless there's some workaround I would like to have silent atomic rules, like _@{} or @_{}.

Currently is there at least some way to suppress atomic rule?

PetrGlad avatar Aug 19 '19 12:08 PetrGlad

Also I find it confusing as ! meaning depends on context - either negation or atomicity cancelling. Without knowing about modifiers one might read !{.....} as "something that does not match .....".

PetrGlad avatar Aug 19 '19 12:08 PetrGlad

Is there another way to do this? introduce a new built-in keyword BOUNDARY which is similar with word boundary in regex which takes zero-length input and matches

  1. previous token is normal token while next token is WHITESPACE or COMMENT
  2. previous token is WHITESPACE or COMMENT while next token is normal token

clouds56 avatar Nov 26 '20 03:11 clouds56

I have the same use case @ejoebstl has - for HCL files: most of the file can just treat WHITESPACE as WHITESPACE, but within an object declaration, objectelem can be separated by either WHITESPACE* ~ "," ~ WHITESPACE*, or by (!newline ~ WHITESPACE)* ~ newline ~ WHITESPACE*; and that latter case is awkward to introduce today without a needless intermediary node in the parse tree due to the lack of silent-atomics (#520 , and previously #118 ).

rbtcollins avatar Jul 20 '21 08:07 rbtcollins

Would it possible to introduce a smaller change-set that would not require v3? I feel like this is an important issue for the community that has been a bit neglected over the years and pushing it to v3 basically makes it so it won't be in anytime soon.

Sytten avatar Aug 22 '22 15:08 Sytten

@Sytten In theory, yes, I'm happy to accept PRs for it as long as it's a non-breaking change to the current grammar. The discussion in this issue was calling for a bigger overhaul of the grammar which would naturally be breaking. What would a smaller change set look like (something for #118 #520)?

tomtau avatar Aug 22 '22 22:08 tomtau

I feel the base use case is that you can't select between WHITESPACE* and WHITESPACE+. So maybe a simpler change would be to allow modifiers on the ~ where the default would be * for backward compatibility. Maybe not a good idea, but just trying to throw ideas out there so we can move forward with a solution.

Sytten avatar Aug 22 '22 23:08 Sytten

@CAD97 any thoughts? The original proposal appeared to be about the extra adjacency selector/operator, but it was co-opted into a discussion about the grammar overhaul. While the grammar overhaul looked to be heading in a good direction, it'd take time. The main argument against the original proposed grammar addition was the increased grammar complexity (unforeseen consequences with parsing behaviour with complex combinations of features; duplication of ~ if no WHITESPACE is defined...). Maybe one way to move this forward would be to have an "experimental" feature-guarded extended meta grammar ?

tomtau avatar Aug 23 '22 14:08 tomtau

The main pitfall with just introducing a new direct adjacency operator is that repetitions also consume WHITESPACE.

In current pest, whether WHITESPACE is consumed is a contextual question of being atomic. This quality is inherited.

So I think what would fit most would be introducing - (or , or whatever) as a no-whitespace adjacency operator, and allowing writing @{ rules } inline to add the atomic context inline. These can be done separately.

CAD97 avatar Aug 23 '22 19:08 CAD97

The main pitfall with just introducing a new direct adjacency operator is that repetitions also consume WHITESPACE.

In current pest, whether WHITESPACE is consumed is a contextual question of being atomic. This quality is inherited.

So I think what would fit most would be introducing - (or whatever) as a no-whitespace adjacency operator, and allowing writing @{ rules } inline to add the atomic context inline (and the same for $ and !). These can be done separately.

CAD97 avatar Aug 23 '22 19:08 CAD97