doc
doc copied to clipboard
Grammar tutorial should clarify what "ignore whitespace" means
Problem or new feature
https://docs.raku.org/language/grammar_tutorial describes regex
, token
, and rule
as follows:
- Regex methods are slow but thorough, they will look back in the string and really try.
- Token methods are faster than regex methods and ignore whitespace. Token methods don't backtrack; they give up after the first possible match.
- Rule methods are the same as token methods except whitespace is not ignored.
It's not clear to me what "Token methods... ignore whitespace" means, and in what way tokens ignore whitespace while rules do not. For example,
'foo' ~~ token { 'f' 'o'+ } # OUTPUT: ï½¢fooï½£
'f oo' ~~ token { 'f' 'o'+ } # OUTPUT: Nil
'foo' ~~ rule { 'f' 'o'+ } # Nil
'f oo' ~~ rule { 'f' 'o'+ } # OUTPUT: ï½¢f ooï½£
'foo' ~~ rule { 'f' 'o'+ } # Nil
'fo o' ~~ rule { 'f' 'o'+ } # Nil
'f o o' ~~ rule { 'f' 'o'+ } # ï½¢f o ï½£
This suggests that tokens do not ignore whitespace; any whitespace between the component parts of a token prevents the token from matching. And in this example, whitespace seems to be mandatory, in that 'foo'
doesn't match the rule "f followed by one or more o". (I think the latter is because the default ws
token is <!ww> \s*
which doesn't match between f
and o
.)
Suggestions
The "ignore" verb in this context is ambiguous: I think the documentation is saying that whitespace inside the body of the token
definition doesn't have any effect. But the first few times I read that section, I thought it was saying that token
methods will ignore whitespace inside the string being matched.
One way to clarify this would be something like:
- Regex methods are slow and ... really try. Whitespace inside the token method body has no effect.
- Token methods are faster than regex methods. Token methods ... match. Whitespace inside the token method body has no effect.
- Rule methods behave like token methods, except each run of whitespace in the body matches a word boundary and any amount of whitespace.
Well, on the surface of it, tokens do ignore whitespace in the expression of the token or rule, as it quite clear since in the first two examples, it's a no-op. However, rule
s do not, again as quite clear in the other examples. Effectively, it looks like a space matches any amount of whitespace, and that's not there. Your suggestions seem quite reasonable, I encourage you to create a PR to incorporate them.
@flwyd
Yes.
Rule methods behave like token methods, except each run of whitespace in the body
It's not every run of whitespace in the body. See https://stackoverflow.com/questions/48892306/when-is-white-space-really-important-in-perl6-grammars
matches a word boundary and any amount of whitespace.
It's not (necessarily) (just) a word boundary. Instead it's:
-
At a conceptual / abstract level it's just a "tokenizing boundary" with no notion of "word" or "whitespace", where "tokenizing" is about the input string being matched, with no necessary correspondence to a
token
. -
That said the default is as you describe. Concretely, in Rakudo, it's
token ws
declared inGrammar.nqp
.
The following may be too complicated, but is hopefully at least good food for thought:
rule
s behave liketoken
s, except whitespace between elements of therule
's pattern requires a corresponding "break" in the input. By default the break needs to be whitespace or a switch between "word" and non "word" characters. For example, the whitespace betweenfoo
andbar
in an input stringfoo bar
would be a matching break, and so would the (zero width) character class shift between$
and100
in$100
.