perl5
perl5 copied to clipboard
Failing regex test that dies with "panic: pp_match start/end pointers"
From @avar
Created by @avar
It's possible to make Perl die in pp_match by manually setting the utf8 flag on a specially crafted string. Initially this happened in a program that wasn't manually setting the flag but I've distilled it down to this simple report.
See this message on perl5-porters for the full report: http://www.nntp.perl.org/group/perl.perl5.porters/2010/02/msg156829.html
This is just an RT copy, as requested.
Perl Info
Flags:
category=core
severity=medium
This perlbug was built using Perl 5.11.4 - Tue Feb 16 12:45:44 GMT 2010
It is being executed now by Perl 5.10.1 - Sun Feb 7 13:06:31 UTC 2010.
Site configuration information for perl 5.10.1:
Configured by Debian Project at Sun Feb 7 13:06:31 UTC 2010.
Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
Platform:
osname=linux, osvers=2.6.32-trunk-amd64, archname=x86_64-linux-gnu-thread-multi
uname='linux madeleine 2.6.32-trunk-amd64 #1 smp sun jan 10 22:40:40 utc 2010 x86_64 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -g',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.4.3', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.10.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
gnulibc_version='2.10.2'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'
Locally applied patches:
@INC for perl 5.10.1:
/etc/perl
/usr/local/lib/perl/5.10.1
/usr/local/share/perl/5.10.1
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.10
/usr/share/perl/5.10
/usr/local/lib/site_perl
.
Environment for perl 5.10.1:
HOME=/home/avar
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/local/bin:/usr/bin:/bin:/usr/games:/home/avar/g/misc-scripts:/home/avar/g/misc-scripts
PERLDOC=-MPod::Text::Ansi
PERL_BADLANG (unset)
SHELL=/bin/bash
From [email protected]
What's the proposed remedy? The behavior we have now is the behavior I would expect.
Munging the internal data of SVf_UTF8 strings isn't an easy mistake to make -- if I recall correctly, the bitwise operators pose the only silent threat. Any tool that lets you set the SVf_UTF8 flag directly -- such as Encode::_utf8_on(), used in the test case -- has big fat red warnings all over its docs: "Do not use unless you know that the STRING is well-formed UTF-8."
I don't know about this specific case, but in general, always being prepared to recover gracefully from bad UTF-8 data at any moment is expensive. I don't think we should sacrifice performance or code clarity for the sake of, e.g. perpetually preparing to swap in the Unicode replacement character. But I don't think we ought to silently propagate bad data via bad match pointers, either. In my opinion, dying with an informative error message is the best behavior.
The time to validate UTF-8 data is on input.
From [email protected]
What's the proposed remedy? The behavior we have now is the behavior I would expect.
Munging the internal data of SVf_UTF8 strings isn't an easy mistake to make -- if I recall correctly, the bitwise operators pose the only silent threat. Any tool that lets you set the SVf_UTF8 flag directly -- such as Encode::_utf8_on(), used in the test case -- has big fat red warnings all over its docs: "Do not use unless you know that the STRING is well-formed UTF-8."
I don't know about this specific case, but in general, always being prepared to recover gracefully from bad UTF-8 data at any moment is expensive. I don't think we should sacrifice performance or code clarity for the sake of, e.g. perpetually preparing to swap in the Unicode replacement character. But I don't think we ought to silently propagate bad data via bad match pointers, either. In my opinion, dying with an informative error message is the best behavior.
The time to validate UTF-8 data is on input.
The RT System itself - Status changed from 'new' to 'open'
From @avar
On Tue, Feb 23, 2010 at 05:19, Marvin Humphrey via RT <perlbug-followup@perl.org> wrote:
What's the proposed remedy? The behavior we have now is the behavior I would expect.
Munging the internal data of SVf_UTF8 strings isn't an easy mistake to make -- if I recall correctly, the bitwise operators pose the only silent threat. Any tool that lets you set the SVf_UTF8 flag directly -- such as Encode::_utf8_on(), used in the test case -- has big fat red warnings all over its docs: "Do not use unless you know that the STRING is well-formed UTF-8."
I don't know about this specific case, but in general, always being prepared to recover gracefully from bad UTF-8 data at any moment is expensive. I don't think we should sacrifice performance or code clarity for the sake of, e.g. perpetually preparing to swap in the Unicode replacement character. But I don't think we ought to silently propagate bad data via bad match pointers, either. In my opinion, dying with an informative error message is the best behavior.
The core issue here seems to be that when this data gets passed into core the regex engine ends up thinking it has a different length than what core's idea of its string length is. This can be seen on the test case by breaking in pp_match where it dies.
As noted in the bug this initially happened on a program that wasn't doing anything particularly evil. I spotted this while parsing a malformed line of the #perl6 IRC log with POE::Component::IRC::Common which does this:
sub irc_to_utf8 { my ($line) = @_; my $utf8 = guess_encoding($line, 'utf8'); return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line); }
Unfortunately I haven't been able to reproduce that again..
From [email protected]
On Tue, Feb 23, 2010 at 05:31:47AM +0000, Ævar Arnfjörð Bjarmason wrote:
The core issue here seems to be that when this data gets passed into core the regex engine ends up thinking it has a different length than what core's idea of its string length is.
But that's because the scalar is in an invalid state. It's not something either the core or the regex engine is equipped to recover from, and my contention is that equipping either the core or the regex engine so that they could successfully patch up that scalar and return it to a valid state under all circumstances would be too expensive. We can't afford to promote "SVf_UTF8 flag set, but invalid byte sequence" to an allowed state.
As noted in the bug this initially happened on a program that wasn't doing anything particularly evil. I spotted this while parsing a malformed line of the #perl6 IRC log with POE::Component::IRC::Common which does this:
sub irc_to_utf8 { my ($line) = @_; my $utf8 = guess_encoding($line, 'utf8'); return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line); }
Unfortunately I haven't been able to reproduce that again..
I submit that that's where the fixable bug lies, and that efforts should go into isolating the problem there. A start might be to submit random bytes sequences to decode() with both of those encoding specifications to see if you can get it to choke.
There's room for one possible improvement within Perl itself: the "panic" error message doesn't tell you what the scalar contained, which would be useful in tracking down what went wrong. A Devel::Peek-style escaped string dump would be nice to have.
Otherwise, I vote to close this bug as a "won't fix".
Marvin Humphrey
From @avar
On Tue, Feb 23, 2010 at 20:53, Marvin Humphrey <marvin@rectangular.com> wrote:
On Tue, Feb 23, 2010 at 05:31:47AM +0000, Ęvar Arnfjörš Bjarmason wrote:
The core issue here seems to be that when this data gets passed into core the regex engine ends up thinking it has a different length than what core's idea of its string length is.
But that's because the scalar is in an invalid state. It's not something either the core or the regex engine is equipped to recover from, and my contention is that equipping either the core or the regex engine so that they could successfully patch up that scalar and return it to a valid state under all circumstances would be too expensive. We can't afford to promote "SVf_UTF8 flag set, but invalid byte sequence" to an allowed state.
I'm not familiar with the UTF-8 bits of the regex engine but why is its idea about the length of an invalid UTF-8 byte sequence marked as UTF-8 different from that of core's scalar functions? That suggest something odd to me.
As noted in the bug this initially happened on a program that wasn't doing anything particularly evil. I spotted this while parsing a malformed line of the #perl6 IRC log with POE::Component::IRC::Common which does this:
sub irc_to_utf8 { my ($line) = @_; my $utf8 = guess_encoding($line, 'utf8'); return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line); }
Unfortunately I haven't been able to reproduce that again..
I submit that that's where the fixable bug lies, and that efforts should go into isolating the problem there. A start might be to submit random bytes sequences to decode() with both of those encoding specifications to see if you can get it to choke.
There's room for one possible improvement within Perl itself: the "panic" error message doesn't tell you what the scalar contained, which would be useful in tracking down what went wrong. A Devel::Peek-style escaped string dump would be nice to have.
Otherwise, I vote to close this bug as a "won't fix".
Here's the short testcase that doesn't do anything too evil that I was looking for:
$ perl -e 'print "\xe2\x90"' | perl -MDevel::Peek -CI -E 'my $in = join q[], <>; Dump($in); () = $in =~ /(.*)/;' SV = PV(0x9672810) at 0x9694668 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"] CUR = 2 LEN = 4 panic: pp_match start/end pointers at -e line 1, <> line 1.
The problem with saying that you shouldn't turn on the UTF-8 flag on invalid data is that people are doing that all the time now on data that may very well be garbage. A lot of modules / programs are using the -C switch or the equivalent $PERL_UNICODE or open.pm options. As it is you can't parse /dev/urandom or IRC log files that may be 99.9% valid UTF-8 data without that 0.01% eventually killing perl if you use these widely used and recommended modules.
I don't think it's acceptable that perl die this way.
From [email protected]
Quoth avarab@gmail.com (=?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?=):
Here's the short testcase that doesn't do anything too evil that I was looking for:
$ perl \-e 'print "\\xe2\\x90"' | perl \-MDevel​::Peek \-CI \-E 'my $in =join q[], <>; Dump($in); () = $in =~ /(.*)/;' SV = PV(0x9672810) at 0x9694668 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"] CUR = 2 LEN = 4 panic: pp_match start/end pointers at -e line 1, <> line 1.
The problem with saying that you shouldn't turn on the UTF-8 flag on invalid data is that people are doing that all the time now on data that may very well be garbage. A lot of modules / programs are using the -C switch or the equivalent $PERL_UNICODE or open.pm options. As it is you can't parse /dev/urandom or IRC log files that may be 99.9% valid UTF-8 data without that 0.01% eventually killing perl if you use these widely used and recommended modules.
As currently implemented pushing a :utf8 layer (as opposed to an :encoding(utf8) layer) onto an input filehandle is *always* a bug, for exactly this reason. IMHO :utf8 should do at least minimal correctness checking, and throw a catchable exception on failure (and no, I don't care if it's slower). :encoding(utf8) is still there for those who need more control.
Ben
From [email protected]
On Wed, Feb 24, 2010 at 12:27:38AM +0000, Ævar Arnfjörð Bjarmason wrote:
I'm not familiar with the UTF-8 bits of the regex engine but why is its idea about the length of an invalid UTF-8 byte sequence marked as UTF-8 different from that of core's scalar functions? That suggest something odd to me.
I don't know the regex engine well myself, but I have written a certain amount of low-level UTF-8 manipulation C code for other projects. The typical way to repair invalid UTF-8 byte sequences is to swap in the replacement character -- e.g. that's what Encode::decode() does by default. That's not going to happen on the fly in the middle of a compiled regex operation, or a substring, or what have you, because it might require reallocation.
However, if you don't do that and you blithely keep on going, it's easy to get stuff like memory read or write overruns. Or rather, you could write the code to be safe, but it would involve one of two fixes. 1) Prescan and if necessary repair the scalar before any dangerous operation. 2) Rewrite your low-level algorithms to perform constant bounds checking, effectively pushing down validation to the byte level. Either of those approaches is unacceptably costly.
Here's the code in pp_hot.c that dies:
if (RX_OFFS(rx)[i].end < 0 || RX_OFFS(rx)[i].start < 0 || len < 0 || len > strend - s) DIE(aTHX_ "panic: pp_match start/end pointers");
That's a post-operation sanity check to ensure that the match points are actually within bounds. It's hard to know exactly how they got out of place, but my guess is that the match position was incremented by a UTF8_SKIP based on a UTF-8 header byte which happened to be followed by an inadequate number of trailing bytes. You're not going to prevent that without checking every UTF8_SKIP addition, which will slow things waaaay down.
Here's the short testcase that doesn't do anything too evil that I was looking for:
$ perl \-e 'print "\\xe2\\x90"' | perl \-MDevel​::Peek \-CI \-E 'my $in =join q[], <>; Dump($in); () = $in =~ /(.*)/;' SV = PV(0x9672810) at 0x9694668 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"] CUR = 2 LEN = 4 panic: pp_match start/end pointers at -e line 1, <> line 1.
The problem with saying that you shouldn't turn on the UTF-8 flag on invalid data is that people are doing that all the time now on data that may very well be garbage. A lot of modules / programs are using the -C switch or the equivalent $PERL_UNICODE or open.pm options. As it is you can't parse /dev/urandom or IRC log files that may be 99.9% valid UTF-8 data without that 0.01% eventually killing perl if you use these widely used and recommended modules.
It's not safe to use -CI with data that is not known to be valid UTF-8. From the docs for binmode():
To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just marks the data as UTF-8 without further checking, while :encoding(utf8) checks the data for actually being valid UTF-8.
I don't think it's acceptable that perl die this way.
I don't think what you want can be achieved without slowing Perl unicode string manipulation operations to a crawl, and that only after theoretical herculean efforts.
Instead, validate on input.
Marvin Humphrey
From @ikegami
On Tue, Feb 23, 2010 at 8:49 PM, Marvin Humphrey <marvin@rectangular.com>wrote:
It's not safe to use -CI with data that is not known to be valid UTF-8. From the docs for binmode():
To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just marks the data as UTF-8 without further checking, while :encoding(utf8) checks the data for actually being valid UTF-8.
That sounds like a bug. How much of a penalty is it to do a basic check on an IO stream?
From @ikegami
On Wed, Feb 24, 2010 at 1:37 PM, Eric Brine <ikegami@adaelis.com> wrote:
On Tue, Feb 23, 2010 at 8:49 PM, Marvin Humphrey <marvin@rectangular.com>wrote:
It's not safe to use -CI with data that is not known to be valid UTF-8. From the docs for binmode():
To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just marks the data as UTF-8 without further checking, while :encoding(utf8) checks the data for actually being valid UTF-8.
That sounds like a bug. How much of a penalty is it to do a basic check on an IO stream?
Note that IO stream checks could take advantage of algorithms that work with UTF-8 (32-bit encoded form max) that we can't use internally for utf8.
From [email protected]
Quoth ikegami@adaelis.com (Eric Brine):
On Wed, Feb 24, 2010 at 1:37 PM, Eric Brine <ikegami@adaelis.com> wrote:
On Tue, Feb 23, 2010 at 8:49 PM, Marvin Humphrey <marvin@rectangular.com>wrote:
It's not safe to use -CI with data that is not known to be valid UTF-8. From the docs for binmode():
To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just marks the data as UTF-8 without further checking, while :encoding(utf8) checks the data for actually being valid UTF-8.
That sounds like a bug. How much of a penalty is it to do a basic check on an IO stream?
Note that IO stream checks could take advantage of algorithms that work with UTF-8 (32-bit encoded form max) that we can't use internally for utf8.
Not if you're expecting to be able to read the output of perl -CO you can't, and that seems a fairly obvious assumption.
Ben
From @avar
On Wed, Feb 24, 2010 at 22:57, Ben Morrow <ben@morrow.me.uk> wrote:
Quoth ikegami@adaelis.com (Eric Brine):
On Wed, Feb 24, 2010 at 1:37 PM, Eric Brine <ikegami@adaelis.com> wrote:
On Tue, Feb 23, 2010 at 8:49 PM, Marvin Humphrey <marvin@rectangular.com>wrote:
It's not safe to use -CI with data that is not known to be valid UTF-8. From the docs for binmode():
To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just marks the data as UTF-8 without further checking, while :encoding(utf8) checks the data for actually being valid UTF-8.
That sounds like a bug. How much of a penalty is it to do a basic check on an IO stream?
Note that IO stream checks could take advantage of algorithms that work with UTF-8 (32-bit encoded form max) that we can't use internally for utf8.
Not if you're expecting to be able to read the output of perl -CO you can't, and that seems a fairly obvious assumption.
Internally processing the stream as UTF-32 doesn't mean you'd output as UTF-32.
From [email protected]
At 11PM +0000 on 24/02/10 Ævar Arnfjörð Bjarmason wrote:
On Wed, Feb 24, 2010 at 22:57, Ben Morrow <ben@morrow.me.uk> wrote:
Quoth ikegami@adaelis.com (Eric Brine):
On Wed, Feb 24, 2010 at 1:37 PM, Eric Brine <ikegami@adaelis.com> wrote:
That sounds like a bug. How much of a penalty is it to do a basic check on an IO stream?
Note that IO stream checks could take advantage of algorithms that work with UTF-8 (32-bit encoded form max) that we can't use internally for utf8.
Not if you're expecting to be able to read the output of perl -CO you can't, and that seems a fairly obvious assumption.
Internally processing the stream as UTF-32 doesn't mean you'd output as UTF-32.
I wasn't clear. Since perl can represent codepoints greater than 2**32 internally, perl -CO (or any other use of :utf8 on an output handle) can produce such technically-invalid over-long utf8 sequences. If :utf8 on input were to assume that all characters fit into 32 bits, this would mean that (sometimes) you wouldn't be able to feed perl's output back to perl without it complaining. This seems unreasonable.
Ben
From @ikegami
On Wed, Feb 24, 2010 at 6:21 PM, Ben Morrow <ben@morrow.me.uk> wrote:
If :utf8 on input were to assume that all characters fit into 32 bits,
First, I didn't mention :utf8. :utf8 is an interface between a PerlIO layer and other Perl code, and I was talking about an external interface. Maybe that's the same thing, maybe not.
this would mean that (sometimes) you wouldn't be able to feed perl's output back to perl without it complaining. This seems unreasonable.
The purpose of IO is to interact, and outputting illegal UTF-8 doesn't accomplish that. Outputting utf8 instead of UTF-8 should be a conscious decision, not the default.
In fact, we already have warnings for some illegal charaters, so we're not currently outputing utf8:
$ perl -C -we'print chr 0xFFFF' | od -c Unicode character 0xffff is illegal at -e line 1. 0000000 357 277 277 0000003
But we're not outputting UTF-8 either:
$ perl -C -we'print chr 0x200_000' | od -c 0000000 370 210 200 200 200 0000005
From @khwilliamson
Eric Brine wrote:
On Wed, Feb 24, 2010 at 6:21 PM, Ben Morrow <ben@morrow.me.uk> wrote:
If :utf8 on input were to assume that all characters fit into 32 bits,
First, I didn't mention :utf8. :utf8 is an interface between a PerlIO layer and other Perl code, and I was talking about an external interface. Maybe that's the same thing, maybe not.
this would mean that (sometimes) you wouldn't be able to feed perl's output back to perl without it complaining. This seems unreasonable.
The purpose of IO is to interact, and outputting illegal UTF-8 doesn't accomplish that. Outputting utf8 instead of UTF-8 should be a conscious decision, not the default.
In fact, we already have warnings for some illegal charaters, so we're not currently outputing utf8:
$ perl -C -we'print chr 0xFFFF' | od -c Unicode character 0xffff is illegal at -e line 1. 0000000 357 277 277 0000003
I feel the need to point out that this character and similar are no longer considered illegal by Unicode, but are "noncharacters." To that end, 5.11.x's message now says: ./perl -Ilib -C -we'print chr 0xFFFF' | od -x Unicode non-character 0xffff is illegal for interchange at -e line 1. 0000000 bfef 00bf 0000003
A set of cooperating applications can accept these characters in I/O, but they should not be sent to an unsuspecting application. And the od gives the legal utf8 for that code point (I checked)
But we're not outputting UTF-8 either:
$ perl -C -we'print chr 0x200_000' | od -c 0000000 370 210 200 200 200 0000005
When I do an od -x instead, I get 0000000 88f8 8080 0080 0000005
I think this is the correct utf8-like byte sequence for this code point. (I can never remember which is the correct term for the utf8 that stops at 0x10FFFF, like the standard says, and which incorporates the same algorithm to get values beyond that which Unicode says it will never go.)
From [email protected]
Quoth public@khwilliamson.com (karl williamson):
I think this is the correct utf8-like byte sequence for this code point. (I can never remember which is the correct term for the utf8 that stops at 0x10FFFF, like the standard says, and which incorporates the same algorithm to get values beyond that which Unicode says it will never go.)
In perl Encode terms, 'UTF-8' means the encoding as defined by the standard (limited to 0x10FFFF, surrogates are illegal, &c.) whereas 'utf8' means perl's extended version of that. I don't know if there are terms in more general use for these concepts.
Ben
This should be automatically fixed when https://github.com/Perl/perl5/pull/19121 is finally merged
For reference, this is the reproducer:
#!/usr/bin/env perl
use Encode;
use Devel::Peek;
my $line = "\xe2\x90\x0a";
chomp(my $str = "\xe2\x90\x0a");
Encode::_utf8_on($line);
Encode::_utf8_on($str);
for ($line, $str) {
Dump($_);
# Doesn't crash
$_ =~ /(.*)/;
# List context
() = $_ =~ /(.*)/;
}
It is still failing with 5.40.
https://github.com/Perl/perl5/pull/19121 is closed, so this is still an issue.
On Wed, Jul 17, 2024 at 02:02:10PM -0700, mauke wrote:
For reference, this is the reproducer:
#!/usr/bin/env perl use Encode; use Devel::Peek; my $line = "\xe2\x90\x0a"; chomp(my $str = "\xe2\x90\x0a"); Encode::_utf8_on($line); Encode::_utf8_on($str); for ($line, $str) { Dump($_); # Doesn't crash $_ =~ /(.*)/; # List context () = $_ =~ /(.*)/; }It is still failing with 5.40.
Just randomly turning the utf8 flag on for an SV containing a bunch of 0x80+ bytes generates a malformed utf8 string. I don't think it's reasonable to expect the regex engine to have to be able to process malformed utf8.
-- The warp engines start playing up a bit, but seem to sort themselves out after a while without any intervention from boy genius Wesley Crusher. -- Things That Never Happen in "Star Trek" #17
Just randomly turning the utf8 flag on for an SV containing a bunch of 0x80+ bytes generates a malformed utf8 string. I don't think it's reasonable to expect the regex engine to have to be able to process malformed utf8. …
Is that documented somewhere? If not, should it be?
On Thu, Jul 18, 2024 at 03:39:35AM -0700, James E Keenan wrote:
Just randomly turning the utf8 flag on for an SV containing a bunch of 0x80+ bytes generates a malformed utf8 string. I don't think it's reasonable to expect the regex engine to have to be able to process malformed utf8. …
Is that documented somewhere? If not, should it be?
Encode::_utf8_on()'s POD says:
[INTERNAL] Turns the I<STRING>'s internal UTF8 flag B
I don't know whether it's formally stated anywhere that the regex engine (and any other part of the perl internals for that matter) isn't expected to cope with malformed utf8.
-- Nothing ventured, nothing lost.
I don't know whether it's formally stated anywhere that the regex engine (and any other part of the perl internals for that matter) isn't expected to cope with malformed utf8.
True, but we certainly mention or test for malformed UTF-8 in many files.
$ ack -il 'malformed UTF-?8' . |sort
cpan/Encode/Encode/encode.h
cpan/Encode/lib/Encode/CN/HZ.pm
cpan/JSON-PP/lib/JSON/PP.pm
cpan/JSON-PP/t/001_utf8.t
cpan/JSON-PP/t/112_upgrade.t
cpan/JSON-PP/t/119_incr_parse_utf8.t
dist/Data-Dumper/Changes
dist/Data-Dumper/t/bugs.t
dist/Devel-PPPort/parts/inc/misc
dist/Devel-PPPort/parts/inc/utf8
ext/XS-APItest/APItest.xs
ext/XS-APItest/t/handy_base.pl
ext/XS-APItest/t/hash.t
ext/XS-APItest/t/utf8_to_bytes.t
ext/XS-APItest/t/utf8_warn_base.pl
handy.h
lib/locale.t
lib/Unicode/UCD.pm
lib/utf8.t
mathoms.c
pod/perl5120delta.pod
pod/perl5122delta.pod
pod/perl5160delta.pod
pod/perl5180delta.pod
pod/perl5181delta.pod
pod/perl5200delta.pod
pod/perl5201delta.pod
pod/perl5260delta.pod
pod/perl5261delta.pod
pod/perl5300delta.pod
pod/perl5320delta.pod
pod/perl58delta.pod
pod/perldeprecation.pod
pod/perldiag.pod
pod/perlunicode.pod
pod/perluniintro.pod
Porting/deparse-skips.txt
pp_pack.c
regcomp.c
regexec.c
t/io/utf8.t
t/lib/croak/toke_l1
t/lib/warnings/sv
t/lib/warnings/toke
t/lib/warnings/utf8
t/loc_tools.pl
toke.c
t/op/lex.t
t/op/pack.t
t/op/pos.t
t/porting/diag.t
t/re/pat_advanced.t
t/re/pat_rt_report.t
t/re/pat.t
t/re/reg_mesg.t
t/re/re_tests
t/run/locale.t
t/uni/parser.t
utf8.c
utf8.h
So perhaps we should be thinking about what degree of formalization of this policy would be useful and feasible.
There is a k8nd of chicken and egg problem here.
One /could/ argue we should validate that a string contains valid utf8 when using _utf8_on() as the validation is somewhat costly and having all utf8 parts of the internals validate would be repetitive and wasteful, so the data should be validated once before it gets into the internals.
But we have /other/ APIs for doing this. _utf8_on() is specifically intended for cases where one knows the string contains valid utf8 and one doesn't want to pay the price of validating it, eg an external library may be documented to return utf8 but will know nothing about our flag.
So basically if one uses _utf8_on() as opposed to a validating APIs one must be sure it is valid. Using this API is specifically intended to allow unvalidated data into the internals without performance penalties. Having the internals hardened against something that shouldn't happen isn't an appropriate solution as it makes other well behaved actors pay for the bad decisions of the few.
What does make some sense IMO would be to have some mode, perhaps an env var, which causes _utf8_on() to validate it's arguments. Then one could enable the special mode to detect improper use without having to pay the penalty for every operation.
On 7/20/24 08:55, Yves Orton wrote:
There is a k8nd of chicken and egg problem here.
One /could/ argue we should validate that a string contains valid utf8 when using _utf8_on() as the validation is somewhat costly and having all utf8 parts of the internals validate would be repetitive and wasteful, so the data should be validated once before it gets into the internals.
But we have /other/ APIs for doing this. _utf8_on() is specifically intended for cases where one knows the string contains valid utf8 and one doesn't want to pay the price of validating it, eg an external library may be documented to return utf8 but will know nothing about our flag.
So basically if one uses _utf8_on() as opposed to a validating APIs one must be sure it is valid. Using this API is specifically intended to allow unvalidated data into the internals without performance penalties. Having the internals hardened against something that shouldn't happen isn't an appropriate solution as it makes other well behaved actors pay for the bad decisions of the few.
I completely agree with the above.
What does make some sense IMO would be to have some mode, perhaps an env var, which causes _utf8_on() to validate it's arguments. Then one could enable the special mode to detect improper use without having to pay the penalty for every operation.
I don't understand this. Why would one want sometimes to check and sometimes to not. If you want to check, use an operation that does; otherwise use this one. And checking for an environment variable's value is expensive, unless done at startup or at some other rare boundary condition.
—
I don't understand this. Why would one want sometimes to check and sometimes to not. If you want to check, use an operation that does; otherwise use this one.
Same reason it can be convenient to run with or without a debugger. I can imagine DEBUG_UTF8_ON being a useful way to debug improper use of the function. Perhaps a library does not always return value utf8, something like that.
And checking for an environment variable's value is expensive, unless done at startup or at some other rare boundary condition.
I was thinking it would be a startup behaviour for debugging panic messages from the core.