content_is() dies when comparing content with non-ASCII chars in UTF-8
I am actually testing a web app that responds with a UTF-8 page. I tried to compile this into a minimal example:
use strict;
use warnings;
use utf8;
use Test2::V0;
use Test::WWW::Mechanize::PSGI;
my $t = Test::WWW::Mechanize::PSGI->new(
app => sub {
return [
200,
[ 'Content-Type' => 'text/html; charset=utf-8' ],
["B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer\n"]
];
},
);
$t->get_ok('/');
warn $t->response->as_string;
$t->content_is("B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer");
Instead of passing the test it dies like this:
$ perl -MCarp::Always mech_utf8_content.pl # Seeded srand with seed '20211227' from local date.
ok 1 - GET /
200 OK
Content-Type: text/html; charset=utf-8
B�hmer
at mech_utf8_content.pl line 19.
not ok 2 - Content is "Böhmer"
# Failed test 'Content is "Böhmer"'
# at mech_utf8_content.pl line 20.
Use of strings with code points over 0xFF as arguments to bitwise xor (^) operator is not allowed at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/LongString.pm line 65.
Test::LongString::_common_prefix_length("B\x{fffd}hmer\x{a}", "B\x{f6}hmer") called at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/LongString.pm line 201
Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer", "Content is \"B\x{f6}hmer\"") called at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/WWW/Mechanize.pm line 900
Test::WWW::Mechanize::content_is(Test::WWW::Mechanize::PSGI=HASH(0x560eb4901708), "B\x{f6}hmer") called at mech_utf8_content.pl line 20
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
Looks like it's Test::LongString that doesn't like it. What version of Test::LongString do you have?
$ cpanm Test::LongString
Test::LongString is up to date. (0.17)
Yeah, Test::LongString might be more clever to handle that input but why is WWW-Mechanize or Test-WWW-Mechanize comparing garbage? I couldn’t make it read the UTF-8 correctly :pensive:
Why do you say it's comparing garbage? Tell me what you're seeing, explicitly.
You see the strings in the stacktrace? The Böhmer from PSGI is different from that from the test method argument:
Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer", "Content is \"B\x{f6}hmer\"")
Maybe Test::LongString would even fail to compare \x{f6} but \x{fffd} is the Unicode replacement character �.
PS: I must confess I’m pretty confused what the actual issue is. Not 100% sure if it’s this distro’s fault at all or if the issue is reading the content or comparing the content correctly.
I see this line in the stacktrace:
Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer"
So to clarify: What you're saying is that "B\x{f6}hmer" is the correct string, but for some reason we are trying to compare it with "B\x{fffd}hmer\x{a}", which is incorrect. The \x{a} at the end would seem to be a line feed, probably added by the ::PSGI test module, no?
You could try loading the page from a file on disk and not go through Test::WWW::Mechanize::PSGI at all.
Hmm, I thought having an external file which could be read in any encoding adds another variable …
I tried and at first it seemed to work:
use strict;
use warnings;
use utf8;
use Test2::V0;
use Test::WWW::Mechanize::PSGI;
use URI::file;
my $t = Test::WWW::Mechanize->new();
my $url = URI::file->new_abs('boehmer.txt')->as_string;
$t->get_ok($url);
warn $t->response->as_string;
$t->content_is("B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer\n");
$ echo "Böhmer" > boehmer.txt && perl -MCarp::Always mech_utf8_content.pl
# Seeded srand with seed '20211227' from local date.
ok 1 - GET file:///home/daniel/workspace/coocook/boehmer.txt
200 OK
Content-Length: 8
Content-Type: text/plain
Last-Modified: Mon, 27 Dec 2021 22:44:26 GMT
Client-Date: Mon, 27 Dec 2021 22:50:15 GMT
Böhmer
at mech_utf8_content.pl line 13.
ok 2 - Content is "Böhmer
# "
# Tests were run but no plan was declared and done_testing() was not seen.
Not sure what that means :thinking: Is the encoding now correct everywhere?
I added a
warn uc $t->response->decoded_content;
and it outputs (with the Unicode replacement character):
B�HMER
at mech_utf8_content.pl line 14.
So Böhmer is not read correctly again.
Is the encoding now correct everywhere?
I know very little about encodings so I'm relying on you to tell me.