ack3
ack3 copied to clipboard
Possible regression: named capture groups in --output pattern
I looked up how to use named capture groups in the --output
pattern and found a Stack Overflow answer titled “How to use named regex groups in ack output?”.
- The given solution works for me in ack 2.22, but not 3.2.0. The feature seems unintentionally removed, unless the syntax was changed to another syntax. (Should ack’s ideal syntax for referring to named capture groups in
--output
even be based on Perl’s$+
syntax?) - The
$+{name}
syntax described in the answer didn’t seem to be documented explicitly under--output
either in the ack 3.2.0 manual or in old versions’ manuals.
Related: #91 #93 #9 #246
$ ack --version
ack 2.22
Running under Perl 5.26.1 at /usr/bin/perl
$ echo '123 zz' | ack '(?P<a>\d+) (?P<b>z+)' --output='$+{b} $+{a}'
zz 123
$ ack --version
ack v3.2.0 (standalone version)
Running under Perl v5.18.2 at /usr/bin/perl
$ echo '123 zz' | ack '(?P<a>\d+) (?P<b>z+)' --output='$+{b} $+{a}'
zz{b} zz{a}
Yes, the security/safety removal of eval
in #91 will have removed support for %+
named captures.
It removed everything not listed on the ticket as being implemented.
Your example test case shows #91 is correctly implemented -- that which is not required is unsupported in --output
.
Linking #9 is correct, that is where new intentional functionality would go.
Much as I like named captures in recent Perl RE myself, I've not come up with a compelling case where they'd greatly simplify an Ack match and --output
.
We won't be returning to eval
.
To add %+
processing back in, we'd need to be convinced that the work of disambiguating $+{c}
from $+somethingElse
was worthwhile, and likely would still balk at handing whole hash %+
and slice @+{@list}
.
It's more plausible something that would be addressed under #9, a new syntax other than emulating Perl qq()
interpolation.
Is there a match and output example that is compelling for using named rather than numbered matches?
Thanks for the quick response.
I don’t think there’s a need to disambiguate $+{c}
and $+somethingElse
. An arbitrary new syntax that’s safe and unambiguous (like $f
was chosen for $filename
) can be devised instead (cf. my parenthetical). Even though $+{c}
would be based on Perl, not overloading $+
to do two functions in --output
seems best – one idea might be $P<c>
to reflect the regex syntax.
My use case:
Right now, I use ack to extract tags scattered in plain text documents. (Rather than build rigorous programs to do simple functionality, I bootstrap more quickly with cli utilities.) Specifically, I expect tags in the form <Author, Category>
. I use e.g. ack --output=$'$1\t$2' '<(.*?), (.*)?>'
to quickly generate a tsv.
However, some disobedient authors consistently use non-standard tags (e.g. swapped order as in <Category, Author>
or missing comma as in <Author Category>
). Notice how the search pattern above still matches in the first case, but gives an incorrect result.
I want the (previously hardcoded) search pattern to be user-overridable, to cover any non-standard format, while fixing the meaning of the output pattern for generating a tsv. I suppose I could make the user specify both the search pattern and the output pattern, but it seems better if I could write something like ack --output=$'$P<author>\t$P<category>' $tag_pattern
, where tag_pattern
is <(?P<author>.*?), (?P<category>.*)?>
by default, but customizable in a settings file per user.
I'd be glad to have a version of --output
that could handle those named captures, if we can do it safely, without eval
, and consistently. When we de-eval
ed the code, we didn't look at the named captures. I don't know what it would take to do the $+{xxx}
evaluation.
Do you think this issue counts as a bug (as it is essentially a regression) or a feature?
One more thought: if a syntax like $+{c}
is added for named capture groups, that raises a question about supporting numbered capture groups beyond $9
, such as $10
or ${10}
as in other apps, or $0
.
Pedantically, it would only be "essentially a regression" if it were ever intentionally added as a feature or fixed as a bug before. Losing an undocumented, unintended "feature" doesn't quite pass the bar for the classical definition.
Pedantry aside, since it was never an intentional feature (beyond previously anything in Perl RE that wasn't specifically forbidden was rather insecurwly allowed vs now only what is specifically allowed), i'd say this is a FEATURE REQUEST. But That's just me, Andy has N votes to my one ☺.
IF this doesn't slow down processing that doesn't use it, it would be good to add, but any implementation PR would be subject to performance testing!
Found myself wanting this feature again today and stumbled again upon the StackOverflow answer linked in the OP. I know use cases are typically asked for, so I had a regular expression with a lot of nested alternations and lookaheads, which would be a good bit more unwieldy with (?:)
in place of every ()
. I can't use character classes instead of alternations since I'm dealing with higher-range Unicode characters. I ended up counting using numbered backreferences and $+
and changing the ()
to (?:)
, but it would have been easier to name the groups I wanted to capture in the --output
flag.
I support this feature request. (IDK how hard it would be to add and alas i don't currently have the TUITs (=spoons) to find out.)
If performance testing shows that detecting named capture %+
references in --output
has an adverse impact, having a .ackrc
and commandline option to enable such would be a way to have choices?
(More generally, in our copious spare time we might want to dig through perldoc perlre
to see what other (not so very) ^recent^ PerlRE extensions besides Named Capture are harder to use or unusable in ack.
Being able to exploit the full power [except for arbitrary code execution!] of whatever PerlRE is running ack should be a goal? If so, this request is obvious.
E.g., i'd like (?x)
as a commandline option, similar to how m{...}x
lifts it out of the RE. But at least it's usable in at the start of the ack RE.
[pedantically: If that is presumed to have been unstated intent in Ack2, then yes this was a regression, we missed that that the %+
PerlRE feature might have been in use with Ack2 when we listed what was legal and safe for Ack3 security.])