Dancer2 icon indicating copy to clipboard operation
Dancer2 copied to clipboard

to_json is double encoding UTF8?

Open ambs opened this issue 9 years ago • 28 comments

try:

package foo;
use Dancer2;
use utf8;

get '/' => sub {
    return to_json({foo => "Simões"});
};
1;

vs

package foo;
use Dancer2;
use utf8;

get '/' => sub {
    return to_json({foo => "Simões"}, { utf8 => 0} );
};
1;

The second works :-|

When fixing this, take care not to break the automatic serializer. Also, I think this is a regression.

ambs avatar Sep 11 '14 15:09 ambs

Update: OK, I think a got it. The whole thing is in D2::Core::Response.pm .When not using a 'set serialzer => something' (https://metacpan.org/pod/distribution/Dancer2/lib/Dancer2/Manual.pod#Serializers1) , i.e. when manually serializing the output in a route handler (like you are doing with to_json), the 'is_encoded' flag in the Response object is NOT set, causing it to encode it with UTF-8 in D2::Core::Response->encode_content (line: ~170). That is why setting { utf8 => 0 } is necessary in this case.

Indeed, only setting utf8 to 0 in to_json made this test to pass for me. But I have the same problem with Dancer1, where I too use { utf8 => 0 } everywhere when it comes to serialization.

use utf8;
use strict;
use warnings;

use Test::More;
use Plack::Test;
use HTTP::Request::Common;
use Encode;

use Dancer2::Serializer::JSON;

my $utf8 = 'ľščťžýáí';

{
    package MyApp;

    use Dancer2;
    use utf8;

    get '/serialize'  => sub {
        return to_json({ foo => $utf8 }, { utf8 => 0 });
    };
}

my $app = Dancer2->psgi_app;

test_psgi $app, sub {
    my $cb = shift;

    my $res = $cb->( GET '/serialize' );
    is $res->code, 200;
    is Encode::decode('utf8', $res->content), '{"foo":"'.$utf8.'"}';

    my $json = $res->content;
    my $serializer = Dancer2::Serializer::JSON->new;

    is_deeply $serializer->deserialize($json),
              { foo => "$utf8" };
};

done_testing();

DavsX avatar Sep 12 '14 10:09 DavsX

I think (not proved by looking into the code) that Dancer is encoding the return value of the route, even if it is already encoded. You know, that "utf8" vs "perl internal format".

Need to look at it, when I finish my hundred other tasks :-|

ambs avatar Sep 12 '14 10:09 ambs

See my updated post above (the first half), I think the problem is described there.

#TODO write a Dancer2::Unicode manual?

DavsX avatar Sep 12 '14 10:09 DavsX

No, do not think a manual is what is missing. This should work out of the box.

If to_json is always used by the user (the serializer can use other method by itself), then just switch off the utf8 encoding in that method by default!

ambs avatar Sep 12 '14 10:09 ambs

Well the 'auto-serializer' uses the 'serialize' method directly, so yes, turning utf-8 to 0 by default when using to_/from_* methods would be an option.

But there could be problems, when people use Dancer2's to_json method for something else, than returning a response (serializing stuff into/from the DB etc. - in those cases the content won't be encoded to UTF-8 by default), so that could break compatibility/cause problems in existing codebases.

Calling the to_json method teoretically could set the is_encoded flag in the response somehow, but then again, there is no guarantee, that to_json is used to return a response.

DavsX avatar Sep 12 '14 10:09 DavsX

Yes, good point.

Would it be possible to ask to_json to return a native Perl string, that will then be encoded accordingly with the filehandle that is used?

ambs avatar Sep 12 '14 10:09 ambs

BTW in your case using "set serializer => 'JSON';" solves the problem; so another option is to document this behaviour and state, that to_json should NOT be used to return a response from a route handler, and an auto-serializer should be used in those occasions.

DavsX avatar Sep 12 '14 10:09 DavsX

Not really. Imagine you want to generate javascript code:

     template 'index.tt' => { json_obj => to_json($myobj) };

This is what I intend to do. I just used a simpler version that was failing.

ambs avatar Sep 12 '14 10:09 ambs

Yeah, good point. I have no idea what would be best in this case.

DavsX avatar Sep 12 '14 10:09 DavsX

I confess I am quite ignorant regarding the utf8 handling of Perl. But if I can say

  use utf8;
 ...

  template 'index.tt' => { name => 'Alberto Simões' };

and it works (it doesn't get double encoded), then it should be possible to make to_json return an encoding agnostic string...

ambs avatar Sep 12 '14 10:09 ambs

Yeah. 'use utf8;' only tells Perl, that the source code (the .pm file) is in utf-8. In this case this use statement is important, because you have the 'Alberto Simões' string directly in the source code. I believe (and i may be wrong), that in this stage that string is unicode. Only after getting it through the Response object is it encoded into utf-8 (into actual bytes). Now the problem with to_json now is, that by default it does internally the same, after which comes the Response object and encodes the encoded string one more time.

While your assumption regarding to_json usage is quite reasonable, it can be achieved by telling JSON not-to-encode the data by setting utf8 to 0. By playing with the JSON's utf8 setting we must either manually specifying { utf8 => 1 } whenever we use to_json for not returning a response from a route handler, or we must specify { utf8 => 0 } everytime we use it for returning a response. If i'd have to choose from this two option, i'd say we should not need to specify any utf8 options when we are returning a response.

As far as I know there is no way of telling if a string is utf8 or not. There is a is_utf8 method in Perl, which tells if the UTF-8 flag is turned on in a string, but the documentation states that it should be only used for debugging purposes.

BTW this is still my favourite unicode tutorial: http://taniquetil.com.ar/unicode.png :)

DavsX avatar Sep 12 '14 11:09 DavsX

My problem is that it is counter-intuitive to set { utf8 => 0 } when you know it is not.

ambs avatar Sep 12 '14 11:09 ambs

Yes, indeed, it is. But if we set utf8 to 0 by default, it would be counter-intuitive to set { utf8 => 1 } elsewhere. There is no one good solution, when the default utf8 config option is involved in the solution.

DavsX avatar Sep 12 '14 11:09 DavsX

Probably there should be a wrapper to to_json that call Encode decode method, decoding the utf8 into perl native form. If someone have the time to test, I would be very interested to know how all these situations would play together with that wrapper.

ambs avatar Sep 12 '14 11:09 ambs

It seems that there are a few scenarios where utf8 serialization is a problem:

get '/' => sub { my $x = to_json(params('form_data')); $obj->process($x); return template 'success'; };
get '/' => sub { template 'success', { data => to_json($data) }; };
get '/' => sub { return to_json(params('form_data')); };

In the first 1 case we would expect to_json to encode UTF-8, while in the last two cases it should not encode. EDIT: My bad. In the second case we should NOT encode inside to_json, because the template's output will be later on encoded in the Response object!

I too was thinking about adding maybe a new DSL that should be only used to return a response from a route handler, like 'response_to_json', which would NOT encode to utf.

DavsX avatar Sep 12 '14 11:09 DavsX

I know Encode is evil, but just to help the discussion:

Editing Dancer2::Serializer::JSON and adding the 'Encode' module:

sub to_json {
    my $s = Dancer2::Serializer::JSON->new;
    decode("utf8", $s->serialize(@_));
}

This hack, for the second two examples.

Anybody can test for the first case? (using the serializer directly?)

ambs avatar Sep 12 '14 13:09 ambs

to_json is one of three helpers that are provided. All three should be consistent in what they return (with respect to characters or octets); which got tricky at one point when we were still using YAML::Any as YAML::XS::Dump returned octets but YAML::Dump returned characters. At that point the "lowest common denominator" was to have them return bytes, which is what they currently do. As @ambs pointed out early in this issue, since these helpers return octets, if you return that output as the response from a route the Response object with encode it again.

We no longer use YAML::Any, so a rethink of what these helpers return is reasonable. It's also worth reading over the comments in #647; as the helpers (like to_json) have some other issues..

veryrusty avatar Sep 14 '14 05:09 veryrusty

Hi, I still have problems with the from_json and to_json helpers provided in the Dancer2::Serializer::JSON. Seems that the utf8 option is set to 1 if not specified, which doesn't work for me all the time. I was able to fix it by checking if the string it utf8 already:

if (!defined $options->{utf8}) { $options->{utf8} = utf8::is_utf8($entity) ? 0 : 1 ; }

satmovi avatar Jan 22 '15 14:01 satmovi

@satmovi It is also possible to change the parameters via the config. The default is to enable utf8, but that's documented.

xsawyerx avatar Sep 28 '15 17:09 xsawyerx

The Dancer2 v0.200000 release includes send_as in core. The initial example provided by @ambs can be written (and do the right thing with respect to encoding):

use Dancer2;

get '/' => sub {
    send_as JSON => { foo => "Simões" };
};

1;

@ambs, @DavsX: Would a doc patch resolve this now?

veryrusty avatar Jun 02 '16 10:06 veryrusty

On 02/06/16 11:45, Russell Jenkins wrote:

The Dancer2 v0.200000 release includes |send_as| in core. The initial example provided by @ambs https://github.com/ambs can be written (and do the right thing with respect to encoding):

|use Dancer2; get '/' => sub { send_as JSON => { foo => "Simões" }; }; 1; |

@ambs https://github.com/ambs, @DavsX https://github.com/DavsX: Would a doc patch resolve this now?

Need to test, then I will say something O:-)

ambs avatar Jun 02 '16 10:06 ambs

@ambs Did you get a chance to do some testing ?

veryrusty avatar Jun 18 '16 13:06 veryrusty

sorry, not yet. Be free to close. I'll annoy again when I stump on it.

ambs avatar Jun 18 '16 13:06 ambs

One more ping?

How about encode_json vs. to_json?

xsawyerx avatar Oct 16 '16 12:10 xsawyerx

Think we can get this resolved for a release this week?

:+1: for suggestion by @xsawyerx

cromedome avatar Oct 09 '17 02:10 cromedome

I run into the same problem (hidden by fuzzed encoding issues solved thanks to @tux) and correct flagged and encoded utf8 strings fed to Dancer2 were double encoded.

As a workaround I added

engines:
  serializer:
    JSON:
      utf8: 0

to config.yml - which hides the problem. But the problem is, that already encoded utf-8 strings are encoded again, while:

# our identified pain
my $utf8_string = Encode::encode("utf8", "Hotline costs \N{EURO SIGN} 0,86 / min");
# works fine: {"foo":"Hotline costs € 0,86 / min"}
say JSON->new->encode({foo => $utf8_string});
# writes codepoints of utf8 bytes: {"foo":"Hotline costs \u00e2\u0082\u00ac 0,86 / min"}
say JSON->new->ascii->encode({foo => $utf8_string});
# double encodes: {"foo":"Hotline costs € 0,86 / min"}
say JSON->new->utf8->encode({foo => $utf8_string});

So conclusion: JSON seems to be able to do the right thing (TM) automatically.

A documentation update clarifying how to workaround the current state would be nice, but from my point of view it won't fix the issue itself.

rehsack avatar Oct 11 '17 12:10 rehsack

We already have both to_json and encode_json as DSL keywords (the later added in commit 7c448c9e).

Their implementation is the same; wrapping JSON::MaybeXS::encode_json, for which the three modules it may choose from (Cpanel::JSON::XS, JSON::XS or JSON::PP) all document encode_json as returning a UTF-8 encoded, binary string (that is, the string contains octets only). In my opinion, having the DSL keywords behave the same (with respect to encoding) as the default exported functions from those JSON modules is the correct behaviour.

The double encoding occurs when an octet string is returned from a route. The around modifier on the response object content attribute is going to encode that again in the current implementation. Note that JSON uses utf8::downgrade ( rather than Encode::encode ) which is documented as a no-op if the string was already in native format. That may be worth investigating if the app encoding is utf8; it may also have its own set of issues.

Again, when you don't have a serializer defined, I'd really encourage send_as JSON => $data_structure if you are returning JSON from a route.

veryrusty avatar Oct 11 '17 13:10 veryrusty

We use serializer => "JSON" and always return utf8 encoded strings (octet can also be some kind of opaque snmp representation, what we don't use). And yes, those strings contain utf8 octets. And they're flagged correctly and JSON respects that flag (since our database returns that the right way).

When around modifier [...] shall return correct utf-8, and you do not trust the correctness of the is_utf8 flag, maybe use following snippet from @tux:

# from H.Merijn Brand
my $u = $s; # assume $s contains the string returned from the route
utf8::downgrade ($u);
my $g = $u;
$u = decode ("utf-8", $u);

# Check if it was double encoded
my $b = encode ("utf-8", $u);
my $final = $b eq $g ? $u : $g;

I would prefer that it it possible to configure whether such an correction is attempted - but of course an enabled default an acceptable approach.

Depending on the input data (utf16, ...) an utf8::downgrade could be insane (resulting string contains wide characters), so I used Encode::_utf8_off (mind the warning).

rehsack avatar Oct 11 '17 15:10 rehsack