ack3
ack3 copied to clipboard
\p{Han} won't work
ack '\p{Han}' 1.txt
1.txt content:
hello 世界
hello word
#!/usr/bin/perl -w
use strict;
use utf8;
my $s1 = "世界";
print "$s1 match Han\n" if ( $s1 =~ qr/\p{Han}/m );
my $s2 = "word";
print "$s2 can't match Han\n" if ( $s2 !~ /\p{Han}/ );
I'm not sure what the code example is supposed to be showing.
Ok, I have a workaround for h2ero.
A variation of the UTF workaround (from UTF-16 thread on Ack-Users and #153 ) works for this but with a new warning (on Perl 5.24-5.28, fatal 5.30) :
$ perl -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}' han.txt
sysread() is deprecated on :utf8 handles. This will be a fatal error in Perl 5.30 at /home/wdr/bin/ack line 302.
hello 世界
(note, same results with all three spellings, UTF-8
, utf-8
, utf8
)
So my hack workaround for Ack+UTF will fail with future Perl 5.30 -- as applying Encoding to sysread
mildly undoes the speed advantage of sysread
so is deprecated, fair. This means the buffer read by sysread
should be UTF decoded per Locale (or per-command or per-file override) after being read, not by the use open IO=>encoding:
hack.
We knew Ack working with UTF-8 Latin-1 accents was pleasantly serendipitous; this tells us that serendipity does not extend to UTF-8 multibyte chars. (The ones that are Wide character in print
if handled naively.)
I believe it is possible to modify Ack3 to properly handle UTF-8, UTF-16LE/BE/BOM/..., UTF-32/BOM but the development effort to multiply the testsuite to handle 4-8 variations of each is a major impediment. (Maybe we should submit UTF-* test suite and UTF switching internal flow as a GSOC opportunity for summer 2020? :smile: )
--passthru
will avoid fatal in 5.30+, by skipping sysread pre-check optimization, but at a cost -- all lines print, and highlights may get unicode mangled.