kaleidoscope
kaleidoscope copied to clipboard
Support repeated capture groups
Despite what I originally came to believe, accessing every element of a repeated capture group is theoretically possible, at least if some reasonable constraints are imposed on where they can occur: specifically, that bound capture groups (of any repetition) should be forbidden inside non-unitary capture groups.
With this constraint in place (which can be enforced by Kaleidoscope at compiletime), it should be possible to construct lists of matches by repeatedly attempting to pattern match the same regular expression on different input strings, each time deleting the characters from the previous match, and checking that the regions are contiguous. The matches will be found in the reverse order, and if more than one repeated group appears, the groups should be extracted in reverse order.
This should be implemented firstly in a non-pattern-matching context so that it can be tested easily.
Capture groups with "range" multiplicity need a bit more effort: in order to match them reliably, they need to be transformed into some number of fixed capture groups plus a repeated capture group by rewriting the regular expression. This will then require careful counting of the capture groups that have been repeated, since their numbering will have changed.
I think this problem is easier than I previously thought. Firstly, we should give each capturing group we wish to capture a name (e.g. r1
, r2
, rn
, etc) of the for (?<rn>X)
. But we should also rewrite repeated capturing groups (those followed by *
, +
, {n}
etc) to be inside an additional set of parentheses, i.e.
f(o.)*bar
would be rewritten, f(?<rn>(o.)*)bar
. We should then construct and compile the pattern from inside the original parentheses, i.e. o.
. We can attempt to match this repeatedly on the match extracted from the outer parentheses to get a list of strings.