Unicode ligature pairs like "fi" and "ss" in a lookbehind, plus -i flag, throws a "Variable length lookbehind not implemented" error
I am using ack 3.5.0.
If I run echo 'BROWNFOX' | ack -i '(?<!fire)fox' from my shell, I get this output:
ack: Invalid regex '(?i)(?<!fire)fox':
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at /usr/local/bin/ack line 602.
But strangely, if I run echo 'BROWNFOX' | ack -i '(?<!ice)fox', I get BROWNFOX as I would expect.
It seems like I only get the error if the lookbehind begins with a lowercase or uppercase f, and has at least one character after it. I do not get the error if I don't use -i.
I think something in Perl is getting confused in the regex parser, and this is not an ack-specific problem. Here are some tests I've tried.
$ perl -E'$x = qr/(?<!ice)fox/'
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?<!fire)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!big)fox/'
$ perl -E'$x = qr/(?i)(?<!fire)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fire)fox/ at -e line 1.
$ perl -E'$x = qr/(?i)(?<!fre)fox/'
$ perl -E'$x = qr/(?i)(?<!dog)fox/'
$ perl -E'$x = qr/(?i)(?<!dig)fox/'
$ perl -E'$x = qr/(?i)(?<!fig)fox/'
Variable length lookbehind not implemented in regex m/(?i)(?<!fig)fox/ at -e line 1.
It looks like the problem is that fi with /i is seen as variable length, as discussed here: https://stackoverflow.com/questions/50356241/variable-length-lookbehind-not-implemented-but-it-isnt-variable-length
Thanks to @wolfsage for pointing me to the StackOverflow answer.
So it looks like the fix is that ack needs to add /aa on the regexes it makes. This will stop it from matching ligatures like it did in the past, but I'm OK with that.
interestingly this error comes and goes with version of Perl.
perlbrew exec perl -e 'print 1 if q(BROWNFOX) =~ /(?<!fire)fox/i'
- works fine for Perl 5.6 through 5.16.3
- fails to compile on 5.17.11 through 5.29.5
- works with warning 5.30.0 , variable lookback now
experimental
Perl 5.30 gives Variable length lookbehind is experimental in regex; marked by <-...
(With -E fails for Perl 5.6 - 5.8.x of course. Adding /aa works on 5.16+, i presume it works on 5.14 when it was added, i don't have that in my Perlbrew farm. Of course /aa fails on 5.6 - 5.12. )
So for Perl version 5.12 or lesser, we do nothing;
for 5.14+, we insert /aa
(should determine which 5.13.x it was inserted in just to be right ?)
This /aa fix may well break the unicode wide character workarounds i'd offered folks in the past ?
End user workaround that @elias6 can use immediately for this edgecase is to wrap their RE on commandline with (?aa:...) or prefix with (?aa:)
compare #222 #153 #262 #258 to see offered workarounds and conflicting feature requests ... and whole "Unicode" tag in Issues https://github.com/beyondgrep/ack3/issues?q=is%3Aissue+utf+label%3Aunicode
Hmm... maybe it does have something to do with ligatures. I get the error when I run ack -i '(?<!ff)', ack -i '(?<!fi)', and ack -i '(?<!fl)', but not ack -i '(?<!fx)'.
This is what I see when I run ack --version:
ack v3.5.0 (standalone version)
Running under Perl v5.18.4 at /usr/bin/perl
@n1vux thanks for offering your workaround. I think my use case is complex enough that it is not worth figuring out how to use it. I have been just doing ack -i '(?<!.ire)fox' and manually picking out the strings I'm looking for.
If text is intended to be matched as ASCII bytes only then applying the aa modifier universally on Perl 5.14+ may be warranted. For example, the byte 0xA0 read into a Perl string without decoding will be interpreted as the unicode character U+00A0 NO-BREAK SPACE when matching with unicode rules, and so \s may match it. But this byte only represents this character if the file happened to be encoded in ISO-8859-1 because that happens to correspond to the unicode mapping. If the file is not being decoded from bytes into characters, \s should not match unicode space characters, even those within the range of possible bytes, and the a/aa modifier achieves this.
On the other hand, if there are instances where the file contents get decoded before matching against the regex, and thus unicode matching is expected to work, the a/aa modifier would disable that ability.
Bill, thanks for pointing out the other Unicode-related tickets. It may be that Can't We Just.... add /aa all over is opening a bigger can of worms.
While we stand s(t)olidly on an assumption that ack is for source-code, and that any natural language use is "off label" use, since Perl and others permit Unicode (typically UTF-8) in source code files including identifiers not just character strings and comments, we really do need to support Unicode at a minimum for adequately scanning Unicode::Tussle's POD and source :smile: (ref to tcgrep website ticket above).
Adding an --(no-)ascii flag (which can be set on or off in .ackrc and reveresed on the commandline) to ack so that the user can decide if they want Flat Ascii or Unicode may be useful and even necessary. (This flag would be also opposite to a --unicode flag that selected UTF-8 vs 16/32 and byte-order, if we ever expand to support such messes?)
This issue and the 4 that i mentioned up thread should ALL be tagged with the unicode label here?