ack3 icon indicating copy to clipboard operation
ack3 copied to clipboard

Possible regression: named capture groups in --output pattern

Open hftf opened this issue 5 years ago • 9 comments

I looked up how to use named capture groups in the --output pattern and found a Stack Overflow answer titled “How to use named regex groups in ack output?”.

  1. The given solution works for me in ack 2.22, but not 3.2.0. The feature seems unintentionally removed, unless the syntax was changed to another syntax. (Should ack’s ideal syntax for referring to named capture groups in --output even be based on Perl’s $+ syntax?)
  2. The $+{name} syntax described in the answer didn’t seem to be documented explicitly under --output either in the ack 3.2.0 manual or in old versions’ manuals.

Related: #91 #93 #9 #246

$ ack --version
ack 2.22
Running under Perl 5.26.1 at /usr/bin/perl
$ echo '123 zz' | ack '(?P<a>\d+) (?P<b>z+)' --output='$+{b} $+{a}'
zz 123
$ ack --version
ack v3.2.0 (standalone version)
Running under Perl v5.18.2 at /usr/bin/perl
$ echo '123 zz' | ack '(?P<a>\d+) (?P<b>z+)' --output='$+{b} $+{a}'
zz{b} zz{a}

hftf avatar Nov 13 '19 23:11 hftf

Yes, the security/safety removal of eval in #91 will have removed support for %+ named captures. It removed everything not listed on the ticket as being implemented. Your example test case shows #91 is correctly implemented -- that which is not required is unsupported in --output. Linking #9 is correct, that is where new intentional functionality would go.

Much as I like named captures in recent Perl RE myself, I've not come up with a compelling case where they'd greatly simplify an Ack match and --output.

We won't be returning to eval.

To add %+ processing back in, we'd need to be convinced that the work of disambiguating $+{c} from $+somethingElse was worthwhile, and likely would still balk at handing whole hash %+ and slice @+{@list} .

It's more plausible something that would be addressed under #9, a new syntax other than emulating Perl qq() interpolation.

Is there a match and output example that is compelling for using named rather than numbered matches?

n1vux avatar Nov 13 '19 23:11 n1vux

Thanks for the quick response.

I don’t think there’s a need to disambiguate $+{c} and $+somethingElse. An arbitrary new syntax that’s safe and unambiguous (like $f was chosen for $filename) can be devised instead (cf. my parenthetical). Even though $+{c} would be based on Perl, not overloading $+ to do two functions in --output seems best – one idea might be $P<c> to reflect the regex syntax.


My use case:

Right now, I use ack to extract tags scattered in plain text documents. (Rather than build rigorous programs to do simple functionality, I bootstrap more quickly with cli utilities.) Specifically, I expect tags in the form <Author, Category>. I use e.g. ack --output=$'$1\t$2' '<(.*?), (.*)?>' to quickly generate a tsv.

However, some disobedient authors consistently use non-standard tags (e.g. swapped order as in <Category, Author> or missing comma as in <Author Category>). Notice how the search pattern above still matches in the first case, but gives an incorrect result.

I want the (previously hardcoded) search pattern to be user-overridable, to cover any non-standard format, while fixing the meaning of the output pattern for generating a tsv. I suppose I could make the user specify both the search pattern and the output pattern, but it seems better if I could write something like ack --output=$'$P<author>\t$P<category>' $tag_pattern, where tag_pattern is <(?P<author>.*?), (?P<category>.*)?> by default, but customizable in a settings file per user.

hftf avatar Nov 14 '19 01:11 hftf

I'd be glad to have a version of --output that could handle those named captures, if we can do it safely, without eval, and consistently. When we de-evaled the code, we didn't look at the named captures. I don't know what it would take to do the $+{xxx} evaluation.

petdance avatar Nov 14 '19 14:11 petdance

Do you think this issue counts as a bug (as it is essentially a regression) or a feature?

One more thought: if a syntax like $+{c} is added for named capture groups, that raises a question about supporting numbered capture groups beyond $9, such as $10 or ${10} as in other apps, or $0.

hftf avatar Nov 14 '19 16:11 hftf

Pedantically, it would only be "essentially a regression" if it were ever intentionally added as a feature or fixed as a bug before. Losing an undocumented, unintended "feature" doesn't quite pass the bar for the classical definition.

Pedantry aside, since it was never an intentional feature (beyond previously anything in Perl RE that wasn't specifically forbidden was rather insecurwly allowed vs now only what is specifically allowed), i'd say this is a FEATURE REQUEST. But That's just me, Andy has N votes to my one ☺.

IF this doesn't slow down processing that doesn't use it, it would be good to add, but any implementation PR would be subject to performance testing!

n1vux avatar Dec 23 '19 21:12 n1vux

Found myself wanting this feature again today and stumbled again upon the StackOverflow answer linked in the OP. I know use cases are typically asked for, so I had a regular expression with a lot of nested alternations and lookaheads, which would be a good bit more unwieldy with (?:) in place of every (). I can't use character classes instead of alternations since I'm dealing with higher-range Unicode characters. I ended up counting using numbered backreferences and $+ and changing the () to (?:), but it would have been easier to name the groups I wanted to capture in the --output flag.

hftf avatar Mar 14 '23 03:03 hftf

I support this feature request. (IDK how hard it would be to add and alas i don't currently have the TUITs (=spoons) to find out.)

If performance testing shows that detecting named capture %+ references in --output has an adverse impact, having a .ackrc and commandline option to enable such would be a way to have choices?

(More generally, in our copious spare time we might want to dig through perldoc perlre to see what other (not so very) ^recent^ PerlRE extensions besides Named Capture are harder to use or unusable in ack. Being able to exploit the full power [except for arbitrary code execution!] of whatever PerlRE is running ack should be a goal? If so, this request is obvious. E.g., i'd like (?x) as a commandline option, similar to how m{...}x lifts it out of the RE. But at least it's usable in at the start of the ack RE. [pedantically: If that is presumed to have been unstated intent in Ack2, then yes this was a regression, we missed that that the %+ PerlRE feature might have been in use with Ack2 when we listed what was legal and safe for Ack3 security.])

n1vux avatar Mar 14 '23 18:03 n1vux