TatSu
TatSu copied to clipboard
Support repetition qualifiers for closures
Could you support:
rule = {expression}{7} ;
or
rule = {expression}{2,5} ;
Example from the re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax, search for "Repetition qualifiers"
I'm sometimes parsing log files, textified pdf, scanned docs or other things not designed to be parsed. One of the reasons I like TatSu for this is you can be sure you really understood the format within a section and can occasionally explain what you're doing to a non-programmer. In contrast when I do the same with regular expressions, I sometimes find myself silently skipping bits (and it's very hard to read!). Such formats often have fixed numbers of repetitions - and it's interesting to know if ones assumption always holds about the number of repetitions.
Also one sometimes gets cases where you have a repetitions followed by up to b repetitions followed by c repetitions where each group is of a different kind - possibly a harder case to manage.
rule = {int}{4} {int}{2,4} {int}{2} ;
Of course I can just measure the list length in semantics, but I feel this is more properly part of the grammar. So this is low priority.
I think is this a good idea!
The syntax would have to be different, non regex-like, because TatSu already defines {}
(and also ()
and []
). There's already a lot of syntax around {}
.
Perhaps it could be:
rule = {int}<4> {int}<2,4> {int}<2> ;
I think that TatSu only allows *
after {}
, so the new syntax could also be:
rule = int*4 int*2-4 (int string)*2 ;
We need to review the current syntax to choose a new one that makes the intention clear and doesn't collide with current semantics.
We should probably first provide an implementation, and decide about the syntax after.
I just spent half an hour trying to find out what other syntaxes do and the only one I could find was 're'! To be fair, it's probably the only repetition qualifier most of your users know. And I understand you reason for rejecting it.
It may be necessary to constrain it so that a sequence of repetition qualifiers can only include one range. So:
rule = int*4 int*2:4 int*2:5 int*3
might not be allowed or might be formally determined so the LHS or RHS is greedy.
Did you notice I experimented with a colon in 2:4? I thought it had a more Pythonic flavour, though repetition isn't much like a slice. Of the two you offer, I mostly like the latter but found the '-' sign grated a little because my mind needs it to be subtraction. Too bad elipsis isn't on a standard keyboard.