test-www-mechanize content_is() dies when comparing content with non-ASCII chars in UTF-8

I am actually testing a web app that responds with a UTF-8 page. I tried to compile this into a minimal example:

use strict;
use warnings;
use utf8;

use Test2::V0;
use Test::WWW::Mechanize::PSGI;

my $t = Test::WWW::Mechanize::PSGI->new(
    app => sub {
        return [
            200,
            [ 'Content-Type' => 'text/html; charset=utf-8' ],
            ["B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer\n"]
        ];
    },
);

$t->get_ok('/');
warn $t->response->as_string;
$t->content_is("B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer");

Instead of passing the test it dies like this:

$ perl -MCarp::Always mech_utf8_content.pl # Seeded srand with seed '20211227' from local date.
ok 1 - GET /
200 OK
Content-Type: text/html; charset=utf-8

B�hmer
 at mech_utf8_content.pl line 19.
not ok 2 - Content is "Böhmer"
#   Failed test 'Content is "Böhmer"'
#   at mech_utf8_content.pl line 20.
Use of strings with code points over 0xFF as arguments to bitwise xor (^) operator is not allowed at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/LongString.pm line 65.
        Test::LongString::_common_prefix_length("B\x{fffd}hmer\x{a}", "B\x{f6}hmer") called at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/LongString.pm line 201
        Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer", "Content is \"B\x{f6}hmer\"") called at /home/daniel/perl5/perlbrew/perls/perl-5.34.0-threads-buster/lib/site_perl/5.34.0/Test/WWW/Mechanize.pm line 900
        Test::WWW::Mechanize::content_is(Test::WWW::Mechanize::PSGI=HASH(0x560eb4901708), "B\x{f6}hmer") called at mech_utf8_content.pl line 20
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.

Dec 27 '21 21:12 dboehmer

Looks like it's Test::LongString that doesn't like it. What version of Test::LongString do you have?

Dec 27 '21 21:12 petdance

$ cpanm Test::LongString
Test::LongString is up to date. (0.17)

Yeah, Test::LongString might be more clever to handle that input but why is WWW-Mechanize or Test-WWW-Mechanize comparing garbage? I couldn’t make it read the UTF-8 correctly :pensive:

Dec 27 '21 21:12 dboehmer

Why do you say it's comparing garbage? Tell me what you're seeing, explicitly.

Dec 27 '21 21:12 petdance

You see the strings in the stacktrace? The Böhmer from PSGI is different from that from the test method argument:

Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer", "Content is \"B\x{f6}hmer\"")

Maybe Test::LongString would even fail to compare \x{f6} but \x{fffd} is the Unicode replacement character �.

PS: I must confess I’m pretty confused what the actual issue is. Not 100% sure if it’s this distro’s fault at all or if the issue is reading the content or comparing the content correctly.

Dec 27 '21 21:12 dboehmer

I see this line in the stacktrace:

Test::LongString::is_string("B\x{fffd}hmer\x{a}", "B\x{f6}hmer"

So to clarify: What you're saying is that "B\x{f6}hmer" is the correct string, but for some reason we are trying to compare it with "B\x{fffd}hmer\x{a}", which is incorrect. The \x{a} at the end would seem to be a line feed, probably added by the ::PSGI test module, no?

You could try loading the page from a file on disk and not go through Test::WWW::Mechanize::PSGI at all.

Dec 27 '21 21:12 petdance

Hmm, I thought having an external file which could be read in any encoding adds another variable …

I tried and at first it seemed to work:

use strict;
use warnings;
use utf8;

use Test2::V0;
use Test::WWW::Mechanize::PSGI;
use URI::file;

my $t = Test::WWW::Mechanize->new();

my $url = URI::file->new_abs('boehmer.txt')->as_string;
$t->get_ok($url);
warn $t->response->as_string;
$t->content_is("B\N{LATIN SMALL LETTER O WITH DIAERESIS}hmer\n");

$ echo "Böhmer" > boehmer.txt && perl -MCarp::Always mech_utf8_content.pl 
# Seeded srand with seed '20211227' from local date.
ok 1 - GET file:///home/daniel/workspace/coocook/boehmer.txt
200 OK
Content-Length: 8
Content-Type: text/plain
Last-Modified: Mon, 27 Dec 2021 22:44:26 GMT
Client-Date: Mon, 27 Dec 2021 22:50:15 GMT

Böhmer
 at mech_utf8_content.pl line 13.
ok 2 - Content is "Böhmer
# "
# Tests were run but no plan was declared and done_testing() was not seen.

Not sure what that means :thinking: Is the encoding now correct everywhere?

I added a

warn uc $t->response->decoded_content;

and it outputs (with the Unicode replacement character):

B�HMER
 at mech_utf8_content.pl line 14.

So Böhmer is not read correctly again.

Dec 27 '21 22:12 dboehmer

Is the encoding now correct everywhere?

I know very little about encodings so I'm relying on you to tell me.

Dec 27 '21 22:12 petdance