perl5
perl5 copied to clipboard
RFE: use heuristic for utf8 usage w/-Mutf8 in PERL5OPT
From [email protected]
Created by [email protected]
For some time I had an odd output in one of my programs where I tried to use a right-pointing double angle quotation mark U+00BB (»). It always came out as "»". I had "use utf8;" in my source, even had use utf8::all; in some, but most of all, thought I was safe with "-Mutf8 -CSA" in PERL5OPT.
Once I'd finished development on older module, I simply used it. If I ran the module as a prog under the debugger, it seemed to work -- problem was that I simply wanted perl to assume modern sources should be treated as utf8, or at worst to output the same bytes as on input. bash does this:
a="»" printf "%s\n" "$a" » printf "%s\n" "$a"|hexdump -C 00000000 c2 bb 0a ---
C does this:
#include <stdio.h> int main(int argc, char *argv[]) { char arr[3]="»"; printf("%s\n", arr); }
gcc ar.c -o ar ar »
I can't think of any language that forces 0x80-0xff into a different encoding in source or input than it outputs.
*Ideally*, perl wouldn't either. However, some would complain of compat probs (though didn't seem to cause end of the world for bash or C doing it that I'm aware of).
BUT, at the very least... a compromise heuristic could be used. A first level heuristic would be:
1) if 0xc2 or 0xc3 followed by another hex byte in the range 0x80-0xff, occurs in source, presume it is utf8 encoded.
For some though, that would still let too much incompat slip through.
To that I say, add:
2) if the ENV var PERL5OPT has -Mutf8 in it -- AND if "1" then assume source is utf8. It might not be 100% compatible, BUT, it lets local user set a presumption for their system. If they run into a module that doesn't work -- they can work around it. Alternatively, have perl access a site config file (I think it can be configured to use one in /etc/?) where they flag can specify it.
if more safety was wanted, as a addon step to 1 or 2 -- 2) or 3) put out a one-time warning with the first byte combo that triggers utf8 encoding on a per-module basis. That way, either the user could silence the warning, or simply add 'use utf8' to the beginning of that module (the latter being more logical).
-----------------
Tangential, but related: Additionally, if a config file is used -- it should be possible to specify stdin/out/err as defaulting to the locale -- the assumption being that streamed I/O is not how one would normally access binary data. The idea being to have perl be [mostly] binary clean in regards to streamed input & output (I realize some want to flag errors on invalid utf8 -- not my first choice, but I don't see a problem with that in streamed i/o as the assumption is one wouldn't use a variable length encoding for storing binary data.
This might assist in putting the infamous perl utf8 bug to rest (at least for the most part). It also introduces the idea of trying to give or do what the user wants based on increasing levels of evidence. Admittedly an imperfect science, but better than using rigid standards when it comes to humans. Perl should "just be smarter".
This isn't version related as it happens under perl 5.24.0 as well as 5.16.3.
Perl Info
Flags:
category=core
severity=wishlist
Site configuration information for perl 5.16.3:
Configured by law at Wed Jan 22 12:58:58 PST 2014.
Summary of my perl5 (revision 5 version 16 subversion 3) configuration:
Platform:
osname=linux, osvers=3.12.0-isht-van, archname=x86_64-linux-thread-multi-ld
uname='linux ishtar 3.12.0-isht-van #1 smp preempt wed nov 13 16:50:51 pst 2013 x86_64 x86_64 x86_64 gnulinux '
config_args=''
hint=previous, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=define
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-g -O2',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
ccversion='', gccversion='4.8.1 20130909 [gcc-4_8-branch revision 202388]', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8
alignbytes=16, prototype=define
Linker and Libraries:
ld='gcc', ldflags ='-g -fstack-protector -fPIC'
libpth=/usr/lib64 /lib64
libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.18.so, so=so, useshrplib=true, libperl=libperl-5.16.3.so
gnulibc_version='2.18'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld/CORE'
cccdlflags='-fPIC', lddlflags='-shared -g -O2 -fstack-protector -fPIC'
Locally applied patches:
@INC for perl 5.16.3:
/home/law/bin/lib
/home/perl/perl-5.16.3/lib/site/x86_64-linux-thread-multi-ld
/home/perl/perl-5.16.3/lib/site
/home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld
/home/perl/perl-5.16.3/lib
.
Environment for perl 5.16.3:
HOME=/home/law
LANG (unset)
LANGUAGE (unset)
LC_COLLATE=C
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=C
LC_MONETARY=C
LC_NUMERIC=C
LC_TIME=C
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/perl/perl-5.24/usr/bin:.:/sbin:/home/law/bin/lib:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/opt/kde3/bin:/usr/sbin:/etc/local/func_lib:/home/law/lib
PERL5OPT=-Mutf8 -CSA -I/home/law/bin/lib
PERL_BADLANG (unset)
SHELL=/bin/bash
From @grinnz
A couple of things:
1. "Output the same bytes as on input." Nothing in Perl prevents this from occurring, but it's impossible to perform character-aware operations (like matching \w against unicode word characters) without knowing what encoding the decode the input from.
2. "use utf8;" only affects the source code itself. It's very different to talk about Perl's treatment of the bytes in the source code, and Perl's treatment of input and output bytes. Other operations are required to translate UTF-8 encoding at STDIN/STDOUT/STDERR, ARGV, and opened filehandle boundaries, among other things. These three things are covered by -CSAD. See https://metacpan.org/pod/perlrun#-C-[number/list]
3. I disagree with the feasability of any of the presented heuristics. It's 100% possible for a single-byte encoded file to look like UTF-8.
4. Using the locale to set default utf8 layers was a failed experiment in (I believe) Perl 5.8.0. You can enable this behavior for yourself with -CSADL (or adding L to your other -C switch arguments, see above link).
5. A potential way forward to at least default to the behavior of 'use utf8;' (decoding source code as UTF-8) was previously discussed in https://www.nntp.perl.org/group/perl.perl5.porters/2017/10/msg246838.html - I don't think there's any reasonable path to defaulting other handles to set utf8 layers.
The RT System itself - Status changed from 'new' to 'open'
I believe that #20976 makes this unnecessary