Add proper Unicode support
Ack should support Unicode properly on perls that can handle it.
- Needs to run on perl 5.14 or later (I think this is the minimum version, verify)
- This isn't entirely true; we can handle encoding/decoding and normalization with 5.8. We can't, however, have Unicode-aware regular expressions, nor can we use the shiny new stuff that is included in 5.14. More details later.
- -g patterns should work properly with Unicode filenames or regexes containing Unicode characters (or stuff from
charnames)- This applies when the composition/decomposition of the regular expression and the source vary.
- The same rules apply to the file searching patterns.
- We need to make sure we properly encode/decode files (this could be tough)
- How do we determine files' encodings? Do we assume UTF-8? Do we provide an option for use in ackrc?
- The output stream should probably be UTF-8 encoded.
- Additional options for collation level should probably be provided.
Details necessary as to what that would entail.
We can't, however, have Unicode-aware regular expressions
I think regexps work fine with unicode in 5.8 (but probably not all, but just most). "The Unicode bug" can be easy worked around with utf8::upgrade.
We need to make sure we properly encode/decode files (this could be tough)
- First of all you need decode
@ARGV. Probably user locale can help there ( I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) ) - Then you, probably, can assume that file encoding match locale encoding (that's actually how
ackworks now -@ARGVencoding matches locale and matches file encoding) . And provide option to override this default. Also, FYI, UTF-8 can be detected with very low false positives (if file contains more than ~5 non-ASCII chars). - Filenames encoding. It's very tricky (but possible) to create an application, which work with unicode, but ignores filename encoding. Sometime filename don't have any encoding at all (garbage bytes - it's happening, actually more often than one can expect), sometimes filename encoding does not match locale.
- AFAIK, Under Linux, most use UTF-8 locales. Under *BSD sungle-byte encoding locales are still default after initial install (and UTF-8 support is not full).
The output stream should probably be UTF-8 encoded.
or, I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) encoded. (note that I18N::Langinfo is core module)
Depressing ... on a lark, i tried running make test with unicode everywhere PERL_UNICODE=SAD inserted into t/runtest.pl to see how bad. Worse than i'd hoped, not as bad as i feared.
Result: FAIL
Failed 10/95 test programs. 30/980 subtests failed.
I count
- 18x
# 'Malformed UTF-8 character (fatal) at ....'on STDERR ** atBasic.pm:176:sub firstliney$buffer =~ s/[\r\n].*//s;** i suspect this means the test file has ISO or Codepage accents not UTF8 ? Which would be a problem with making this workaround universal ? - 9x
t/filetypes.tFailed test 'xxxxx.pod can be yyyy'getting too few choices, unclear why unicode interferes with that. But interfering with ack type inference is interfering with core functionality if it isn't a test-framework glitch. - not sure how which 3 failed subtests i haven't categorized :-)
Inserting the -CSAD equivalent workaround on the hashbang in blib/scripts/ack is not the way to force it ...
'Too late for "-CSAD" option at /home/wdr/100G/perl/github/ack2/ack2/blib/script/ack line 1.'
yup, perlrun warns that it must be on commandline not just in hashbang.