p5-encode
p5-encode copied to clipboard
Detect invalid UTF-8 data at end of file when using PerlIO :encoding(utf-8)
PerlIO layer :encoding(utf-8) seems to fail to report malformed data at the end of a file.
Suppose a file $fn contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve.
For example:
use feature qw(say);
use strict;
use warnings;
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}"; # 2 bytes in iso 8859-1: aå
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
now $fn contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer :encoding(utf-8):
my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'";
the output is
Read string: 'a'
Note, that there is no warning "\xE5" does not map to Unicode in this case.
However, if I read the file as bytes and then use Encode::decode() on the raw data, the warnings is printed:
open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
$raw_data = do { local $/; <$fh> };
close $fh;
my $str2 = decode( 'utf-8', $raw_data, Encode::FB_WARN | Encode::LEAVE_SRC );
# warning is printed to STDERR
Why cannot the same thing be achieved with PerlIO::encoding? Is it a bug?
See https://metacpan.org/pod/PerlIO::encoding There is variable $PerlIO::encoding::fallback and by default WARN_ON_ERR bit is set.
So yes, it is bug as you did not get warning.
@pali Yes when I try add in the code above (before starting to read the file):
use PerlIO::encoding;
printf "Current value of \$PerlIO::encoding::fallback is '0x%X'\n", $PerlIO::encoding::fallback;
The output is
Current value of $PerlIO::encoding::fallback is '0x902'
which shows that the bitmask constants WARN_ON_ERR and PERLQQ are set by default. There is also an undefined/undocumented bitmask 0x800
(0x902 & 0x800) == 0x800 that is set by default.
Interestingly, if I try to change the value to a code ref before reading:
$PerlIO::encoding::fallback = sub{ sprintf "<U+%04X>", shift };
The code hangs at readline (i.e. : <$fh>).. Is this another bug?
Look at PerlIO::encoding source code, by default are set these bits:
our $fallback =
Encode::PERLQQ()|Encode::WARN_ON_ERR()|Encode::STOP_AT_PARTIAL();
Coderef check is supported only by some XS Encode modules, probably not by PerlIO::encoding.
Looks like this is not Encode bug, but PerlIO::encoding! And PerlIO is part of Perl itself. Please report this bug directly to Perl.
I used this test script:
use strict;
use warnings;
use Encode;
binmode STDOUT, ':utf8';
my $bytes = "\x{61}\x{E5}";
my $fh;
my $buf;
open $fh, '>:raw', \$buf;
print $fh $bytes;
close $fh;
open $fh, "<:encoding(UTF-8)", \$buf;
my $str = do { local $/; <$fh> };
close $fh;
print "$str\n";
open $fh, "<:raw", \$buf;
my $raw = do { local $/; <$fh> };
close $fh;
my $str2 = decode('UTF-8', $raw, Encode::FB_WARN | Encode::LEAVE_SRC);
print "$str2\n";
It turns out this is partly an Encode issue too.
PerlIO::encoding "renew"s the encoding object to ensure it has it's own encoding object (per Encode::Encoding), but Encode::decode_xs() treats such a renewed object as always stop_at_partial, which means that PerlIO::encoding can't use that encoding object to process that little bit of excess data at eof.
So I'm stuck trying to fix this on the PerlIO::encoding side.
Unfortunately, simply removing that renewed -> stop_at_partial will break PerlIO::encoding on validly encoded files on older perls, so I don't see a simple fix.
Bug is in PerlIO::scalar and was fixed in perl 5.25.8 by this commit: https://perl5.git.perl.org/perl.git/commit/c47992b404786dcb8752239045e21cbcd7e3d103
There's an issue in PerlIO::encoding and the way it interacts with Encode too:
$ ./perl -e 'print "\xef\xbe"' >shortuni.txt $ hd shortuni.txt 00000000 ef be |..| 00000002 $ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while (<STDIN>) { print }' <shortuni.txt (no output)
but it should be outputing a warning and \x{00EF}, like the following does:
$ ./perl -e 'print "\xef\xbeA"' >shortuni.txt $ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while (<STDIN>) { print }' <shortuni.txt utf8 "\xEF" does not map to Unicode at -e line 1. \x{00EF}A
This is blead at v5.25.9-35-g32207c6 which includes the (irrelevant) PerlIO::scalar fix.