ack3 Unicode ligature pairs like "fi" and "ss" in a lookbehind, plus -i flag, throws a "Variable length lookbehind not implemented" error

I am using ack 3.5.0.

If I run echo 'BROWNFOX' | ack -i '(?<!fire)fox' from my shell, I get this output:

ack: Invalid regex '(?i)(?<!fire)fox':
  Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at /usr/local/bin/ack line 602.

But strangely, if I run echo 'BROWNFOX' | ack -i '(?<!ice)fox', I get BROWNFOX as I would expect.

It seems like I only get the error if the lookbehind begins with a lowercase or uppercase f, and has at least one character after it. I do not get the error if I don't use -i.

Mar 31 '21 03:03 elias6

I think something in Perl is getting confused in the regex parser, and this is not an ack-specific problem. Here are some tests I've tried.

$ perl -E'$x = qr/(?<!ice)fox/'
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!big)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!fre)fox/'
$ perl -E'$x = qr/(?i)(?<!dog)fox/'
$ perl -E'$x = qr/(?i)(?<!dig)fox/'
$ perl -E'$x = qr/(?i)(?<!fig)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fig)fox/ at -e line 1.

Mar 31 '21 14:03 petdance

It looks like the problem is that fi with /i is seen as variable length, as discussed here: https://stackoverflow.com/questions/50356241/variable-length-lookbehind-not-implemented-but-it-isnt-variable-length

Thanks to @wolfsage for pointing me to the StackOverflow answer.

Mar 31 '21 14:03 petdance

So it looks like the fix is that ack needs to add /aa on the regexes it makes. This will stop it from matching ligatures like it did in the past, but I'm OK with that.

Mar 31 '21 14:03 petdance

interestingly this error comes and goes with version of Perl. perlbrew exec perl -e 'print 1 if q(BROWNFOX) =~ /(?<!fire)fox/i'

works fine for Perl 5.6 through 5.16.3
fails to compile on 5.17.11 through 5.29.5
works with warning 5.30.0 , variable lookback now experimental

Perl 5.30 gives Variable length lookbehind is experimental in regex; marked by <-...

(With -E fails for Perl 5.6 - 5.8.x of course. Adding /aa works on 5.16+, i presume it works on 5.14 when it was added, i don't have that in my Perlbrew farm. Of course /aa fails on 5.6 - 5.12. )

Mar 31 '21 15:03 n1vux

So for Perl version 5.12 or lesser, we do nothing; for 5.14+, we insert /aa (should determine which 5.13.x it was inserted in just to be right ?)

Mar 31 '21 15:03 n1vux

This /aa fix may well break the unicode wide character workarounds i'd offered folks in the past ?

End user workaround that @elias6 can use immediately for this edgecase is to wrap their RE on commandline with (?aa:...) or prefix with (?aa:)

Mar 31 '21 15:03 n1vux

compare #222 #153 #262 #258 to see offered workarounds and conflicting feature requests ... and whole "Unicode" tag in Issues https://github.com/beyondgrep/ack3/issues?q=is%3Aissue+utf+label%3Aunicode

Mar 31 '21 16:03 n1vux

Hmm... maybe it does have something to do with ligatures. I get the error when I run ack -i '(?<!ff)', ack -i '(?<!fi)', and ack -i '(?<!fl)', but not ack -i '(?<!fx)'.

This is what I see when I run ack --version:

ack v3.5.0 (standalone version)
Running under Perl v5.18.4 at /usr/bin/perl

Mar 31 '21 16:03 elias6

@n1vux thanks for offering your workaround. I think my use case is complex enough that it is not worth figuring out how to use it. I have been just doing ack -i '(?<!.ire)fox' and manually picking out the strings I'm looking for.

Mar 31 '21 16:03 elias6

If text is intended to be matched as ASCII bytes only then applying the aa modifier universally on Perl 5.14+ may be warranted. For example, the byte 0xA0 read into a Perl string without decoding will be interpreted as the unicode character U+00A0 NO-BREAK SPACE when matching with unicode rules, and so \s may match it. But this byte only represents this character if the file happened to be encoded in ISO-8859-1 because that happens to correspond to the unicode mapping. If the file is not being decoded from bytes into characters, \s should not match unicode space characters, even those within the range of possible bytes, and the a/aa modifier achieves this.

On the other hand, if there are instances where the file contents get decoded before matching against the regex, and thus unicode matching is expected to work, the a/aa modifier would disable that ability.

Mar 31 '21 16:03 Grinnz

Bill, thanks for pointing out the other Unicode-related tickets. It may be that Can't We Just.... add /aa all over is opening a bigger can of worms.

Mar 31 '21 16:03 petdance

While we stand s(t)olidly on an assumption that ack is for source-code, and that any natural language use is "off label" use, since Perl and others permit Unicode (typically UTF-8) in source code files including identifiers not just character strings and comments, we really do need to support Unicode at a minimum for adequately scanning Unicode::Tussle's POD and source :smile: (ref to tcgrep website ticket above).

Adding an --(no-)ascii flag (which can be set on or off in .ackrc and reveresed on the commandline) to ack so that the user can decide if they want Flat Ascii or Unicode may be useful and even necessary. (This flag would be also opposite to a --unicode flag that selected UTF-8 vs 16/32 and byte-order, if we ever expand to support such messes?)

This issue and the 4 that i mentioned up thread should ALL be tagged with the unicode label here?

Apr 12 '21 02:04 n1vux