doc icon indicating copy to clipboard operation
doc copied to clipboard

Grammar tutorial should clarify what "ignore whitespace" means

Open flwyd opened this issue 3 years ago • 2 comments

Problem or new feature

https://docs.raku.org/language/grammar_tutorial describes regex, token, and rule as follows:

  • Regex methods are slow but thorough, they will look back in the string and really try.
  • Token methods are faster than regex methods and ignore whitespace. Token methods don't backtrack; they give up after the first possible match.
  • Rule methods are the same as token methods except whitespace is not ignored.

It's not clear to me what "Token methods... ignore whitespace" means, and in what way tokens ignore whitespace while rules do not. For example,

'foo' ~~ token { 'f' 'o'+ }  # OUTPUT: ï½¢fooï½£
'f oo' ~~ token { 'f' 'o'+ }  # OUTPUT: Nil
'foo' ~~ rule { 'f' 'o'+ }  # Nil
'f oo' ~~ rule { 'f' 'o'+ }  # OUTPUT: ï½¢f ooï½£
'foo' ~~ rule { 'f' 'o'+ }  # Nil
'fo o' ~~ rule { 'f' 'o'+ }  # Nil
'f   o o' ~~ rule { 'f' 'o'+ }  # ï½¢f   o ï½£

This suggests that tokens do not ignore whitespace; any whitespace between the component parts of a token prevents the token from matching. And in this example, whitespace seems to be mandatory, in that 'foo' doesn't match the rule "f followed by one or more o". (I think the latter is because the default ws token is <!ww> \s* which doesn't match between f and o.)

Suggestions

The "ignore" verb in this context is ambiguous: I think the documentation is saying that whitespace inside the body of the token definition doesn't have any effect. But the first few times I read that section, I thought it was saying that token methods will ignore whitespace inside the string being matched.

One way to clarify this would be something like:

  • Regex methods are slow and ... really try. Whitespace inside the token method body has no effect.
  • Token methods are faster than regex methods. Token methods ... match. Whitespace inside the token method body has no effect.
  • Rule methods behave like token methods, except each run of whitespace in the body matches a word boundary and any amount of whitespace.

flwyd avatar Nov 21 '21 01:11 flwyd

Well, on the surface of it, tokens do ignore whitespace in the expression of the token or rule, as it quite clear since in the first two examples, it's a no-op. However, rules do not, again as quite clear in the other examples. Effectively, it looks like a space matches any amount of whitespace, and that's not there. Your suggestions seem quite reasonable, I encourage you to create a PR to incorporate them.

JJ avatar Nov 21 '21 08:11 JJ

@flwyd

Yes.

Rule methods behave like token methods, except each run of whitespace in the body

It's not every run of whitespace in the body. See https://stackoverflow.com/questions/48892306/when-is-white-space-really-important-in-perl6-grammars


matches a word boundary and any amount of whitespace.

It's not (necessarily) (just) a word boundary. Instead it's:

  • At a conceptual / abstract level it's just a "tokenizing boundary" with no notion of "word" or "whitespace", where "tokenizing" is about the input string being matched, with no necessary correspondence to a token.

  • That said the default is as you describe. Concretely, in Rakudo, it's token ws declared in Grammar.nqp.


The following may be too complicated, but is hopefully at least good food for thought:

rules behave like tokens, except whitespace between elements of the rule's pattern requires a corresponding "break" in the input. By default the break needs to be whitespace or a switch between "word" and non "word" characters. For example, the whitespace between foo and bar in an input string foo bar would be a matching break, and so would the (zero width) character class shift between $ and 100 in $100.

raiph avatar Nov 21 '21 23:11 raiph