ack3 icon indicating copy to clipboard operation
ack3 copied to clipboard

\p{Han} won't work

Open h2ero opened this issue 5 years ago • 4 comments

ack '\p{Han}' 1.txt

1.txt content:

hello 世界
hello word

h2ero avatar Aug 08 '19 06:08 h2ero

#!/usr/bin/perl -w
use strict;
use utf8; 

my $s1 = "世界";
print "$s1 match Han\n" if ( $s1 =~ qr/\p{Han}/m );

my $s2 = "word";
print "$s2 can't match Han\n" if ( $s2 !~ /\p{Han}/ );

h2ero avatar Aug 08 '19 06:08 h2ero

I'm not sure what the code example is supposed to be showing.

petdance avatar Aug 08 '19 15:08 petdance

Ok, I have a workaround for h2ero.

A variation of the UTF workaround (from UTF-16 thread on Ack-Users and #153 ) works for this but with a new warning (on Perl 5.24-5.28, fatal 5.30) :

$ perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}' han.txt
sysread() is deprecated on :utf8 handles. This will be a fatal error in Perl 5.30 at /home/wdr/bin/ack line 302.
hello 世界

(note, same results with all three spellings, UTF-8, utf-8, utf8)

So my hack workaround for Ack+UTF will fail with future Perl 5.30 -- as applying Encoding to sysread mildly undoes the speed advantage of sysread so is deprecated, fair. This means the buffer read by sysread should be UTF decoded per Locale (or per-command or per-file override) after being read, not by the use open IO=>encoding: hack.

We knew Ack working with UTF-8 Latin-1 accents was pleasantly serendipitous; this tells us that serendipity does not extend to UTF-8 multibyte chars. (The ones that are Wide character in print if handled naively.)

I believe it is possible to modify Ack3 to properly handle UTF-8, UTF-16LE/BE/BOM/..., UTF-32/BOM but the development effort to multiply the testsuite to handle 4-8 variations of each is a major impediment. (Maybe we should submit UTF-* test suite and UTF switching internal flow as a GSOC opportunity for summer 2020? :smile: )

n1vux avatar Aug 08 '19 15:08 n1vux

--passthru will avoid fatal in 5.30+, by skipping sysread pre-check optimization, but at a cost -- all lines print, and highlights may get unicode mangled.

n1vux avatar Dec 03 '21 18:12 n1vux