ack3 icon indicating copy to clipboard operation
ack3 copied to clipboard

Add proper Unicode support

Open hoelzro opened this issue 13 years ago • 3 comments

Ack should support Unicode properly on perls that can handle it.

  • Needs to run on perl 5.14 or later (I think this is the minimum version, verify)
    • This isn't entirely true; we can handle encoding/decoding and normalization with 5.8. We can't, however, have Unicode-aware regular expressions, nor can we use the shiny new stuff that is included in 5.14. More details later.
  • -g patterns should work properly with Unicode filenames or regexes containing Unicode characters (or stuff from charnames)
    • This applies when the composition/decomposition of the regular expression and the source vary.
  • The same rules apply to the file searching patterns.
  • We need to make sure we properly encode/decode files (this could be tough)
    • How do we determine files' encodings? Do we assume UTF-8? Do we provide an option for use in ackrc?
  • The output stream should probably be UTF-8 encoded.
  • Additional options for collation level should probably be provided.

hoelzro avatar Jun 20 '12 18:06 hoelzro

Details necessary as to what that would entail.

petdance avatar Jun 20 '12 18:06 petdance

We can't, however, have Unicode-aware regular expressions

I think regexps work fine with unicode in 5.8 (but probably not all, but just most). "The Unicode bug" can be easy worked around with utf8::upgrade.

We need to make sure we properly encode/decode files (this could be tough)

  1. First of all you need decode @ARGV. Probably user locale can help there ( I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) )
  2. Then you, probably, can assume that file encoding match locale encoding (that's actually how ack works now - @ARGV encoding matches locale and matches file encoding) . And provide option to override this default. Also, FYI, UTF-8 can be detected with very low false positives (if file contains more than ~5 non-ASCII chars).
  3. Filenames encoding. It's very tricky (but possible) to create an application, which work with unicode, but ignores filename encoding. Sometime filename don't have any encoding at all (garbage bytes - it's happening, actually more often than one can expect), sometimes filename encoding does not match locale.
  4. AFAIK, Under Linux, most use UTF-8 locales. Under *BSD sungle-byte encoding locales are still default after initial install (and UTF-8 support is not full).

The output stream should probably be UTF-8 encoded.

or, I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) encoded. (note that I18N::Langinfo is core module)

vsespb avatar Sep 03 '13 15:09 vsespb

Depressing ... on a lark, i tried running make test with unicode everywhere PERL_UNICODE=SAD inserted into t/runtest.pl to see how bad. Worse than i'd hoped, not as bad as i feared.

Result: FAIL
Failed 10/95 test programs. 30/980 subtests failed.

I count

  • 18x # 'Malformed UTF-8 character (fatal) at ....' on STDERR ** at Basic.pm:176: sub firstliney $buffer =~ s/[\r\n].*//s; ** i suspect this means the test file has ISO or Codepage accents not UTF8 ? Which would be a problem with making this workaround universal ?
  • 9x t/filetypes.t Failed test 'xxxxx.pod can be yyyy' getting too few choices, unclear why unicode interferes with that. But interfering with ack type inference is interfering with core functionality if it isn't a test-framework glitch.
  • not sure how which 3 failed subtests i haven't categorized :-)

Inserting the -CSAD equivalent workaround on the hashbang in blib/scripts/ack is not the way to force it ...

 'Too late for "-CSAD" option at /home/wdr/100G/perl/github/ack2/ack2/blib/script/ack line 1.'

yup, perlrun warns that it must be on commandline not just in hashbang.

n1vux avatar Jul 31 '15 23:07 n1vux