Dancer2
Dancer2 copied to clipboard
to_json is double encoding UTF8?
try:
package foo;
use Dancer2;
use utf8;
get '/' => sub {
return to_json({foo => "Simões"});
};
1;
vs
package foo;
use Dancer2;
use utf8;
get '/' => sub {
return to_json({foo => "Simões"}, { utf8 => 0} );
};
1;
The second works :-|
When fixing this, take care not to break the automatic serializer. Also, I think this is a regression.
Update: OK, I think a got it. The whole thing is in D2::Core::Response.pm .When not using a 'set serialzer => something' (https://metacpan.org/pod/distribution/Dancer2/lib/Dancer2/Manual.pod#Serializers1) , i.e. when manually serializing the output in a route handler (like you are doing with to_json), the 'is_encoded' flag in the Response object is NOT set, causing it to encode it with UTF-8 in D2::Core::Response->encode_content (line: ~170). That is why setting { utf8 => 0 } is necessary in this case.
Indeed, only setting utf8 to 0 in to_json made this test to pass for me. But I have the same problem with Dancer1, where I too use { utf8 => 0 } everywhere when it comes to serialization.
use utf8;
use strict;
use warnings;
use Test::More;
use Plack::Test;
use HTTP::Request::Common;
use Encode;
use Dancer2::Serializer::JSON;
my $utf8 = 'ľščťžýáí';
{
package MyApp;
use Dancer2;
use utf8;
get '/serialize' => sub {
return to_json({ foo => $utf8 }, { utf8 => 0 });
};
}
my $app = Dancer2->psgi_app;
test_psgi $app, sub {
my $cb = shift;
my $res = $cb->( GET '/serialize' );
is $res->code, 200;
is Encode::decode('utf8', $res->content), '{"foo":"'.$utf8.'"}';
my $json = $res->content;
my $serializer = Dancer2::Serializer::JSON->new;
is_deeply $serializer->deserialize($json),
{ foo => "$utf8" };
};
done_testing();
I think (not proved by looking into the code) that Dancer is encoding the return value of the route, even if it is already encoded. You know, that "utf8" vs "perl internal format".
Need to look at it, when I finish my hundred other tasks :-|
See my updated post above (the first half), I think the problem is described there.
#TODO write a Dancer2::Unicode manual?
No, do not think a manual is what is missing. This should work out of the box.
If to_json is always used by the user (the serializer can use other method by itself), then just switch off the utf8 encoding in that method by default!
Well the 'auto-serializer' uses the 'serialize' method directly, so yes, turning utf-8 to 0 by default when using to_/from_* methods would be an option.
But there could be problems, when people use Dancer2's to_json method for something else, than returning a response (serializing stuff into/from the DB etc. - in those cases the content won't be encoded to UTF-8 by default), so that could break compatibility/cause problems in existing codebases.
Calling the to_json method teoretically could set the is_encoded flag in the response somehow, but then again, there is no guarantee, that to_json is used to return a response.
Yes, good point.
Would it be possible to ask to_json to return a native Perl string, that will then be encoded accordingly with the filehandle that is used?
BTW in your case using "set serializer => 'JSON';" solves the problem; so another option is to document this behaviour and state, that to_json should NOT be used to return a response from a route handler, and an auto-serializer should be used in those occasions.
Not really. Imagine you want to generate javascript code:
template 'index.tt' => { json_obj => to_json($myobj) };
This is what I intend to do. I just used a simpler version that was failing.
Yeah, good point. I have no idea what would be best in this case.
I confess I am quite ignorant regarding the utf8 handling of Perl. But if I can say
use utf8;
...
template 'index.tt' => { name => 'Alberto Simões' };
and it works (it doesn't get double encoded), then it should be possible to make to_json
return an encoding agnostic string...
Yeah. 'use utf8;' only tells Perl, that the source code (the .pm file) is in utf-8. In this case this use statement is important, because you have the 'Alberto Simões' string directly in the source code. I believe (and i may be wrong), that in this stage that string is unicode. Only after getting it through the Response object is it encoded into utf-8 (into actual bytes). Now the problem with to_json now is, that by default it does internally the same, after which comes the Response object and encodes the encoded string one more time.
While your assumption regarding to_json usage is quite reasonable, it can be achieved by telling JSON not-to-encode the data by setting utf8 to 0. By playing with the JSON's utf8 setting we must either manually specifying { utf8 => 1 } whenever we use to_json for not returning a response from a route handler, or we must specify { utf8 => 0 } everytime we use it for returning a response. If i'd have to choose from this two option, i'd say we should not need to specify any utf8 options when we are returning a response.
As far as I know there is no way of telling if a string is utf8 or not. There is a is_utf8 method in Perl, which tells if the UTF-8 flag is turned on in a string, but the documentation states that it should be only used for debugging purposes.
BTW this is still my favourite unicode tutorial: http://taniquetil.com.ar/unicode.png :)
My problem is that it is counter-intuitive to set { utf8 => 0 } when you know it is not.
Yes, indeed, it is. But if we set utf8 to 0 by default, it would be counter-intuitive to set { utf8 => 1 } elsewhere. There is no one good solution, when the default utf8 config option is involved in the solution.
Probably there should be a wrapper to to_json that call Encode decode method, decoding the utf8 into perl native form. If someone have the time to test, I would be very interested to know how all these situations would play together with that wrapper.
It seems that there are a few scenarios where utf8 serialization is a problem:
get '/' => sub { my $x = to_json(params('form_data')); $obj->process($x); return template 'success'; };
get '/' => sub { template 'success', { data => to_json($data) }; };
get '/' => sub { return to_json(params('form_data')); };
In the first 1 case we would expect to_json to encode UTF-8, while in the last two cases it should not encode. EDIT: My bad. In the second case we should NOT encode inside to_json, because the template's output will be later on encoded in the Response object!
I too was thinking about adding maybe a new DSL that should be only used to return a response from a route handler, like 'response_to_json', which would NOT encode to utf.
I know Encode
is evil, but just to help the discussion:
Editing Dancer2::Serializer::JSON and adding the 'Encode' module:
sub to_json {
my $s = Dancer2::Serializer::JSON->new;
decode("utf8", $s->serialize(@_));
}
This hack, for the second two examples.
Anybody can test for the first case? (using the serializer directly?)
to_json
is one of three helpers that are provided. All three should be consistent in what they return (with respect to characters or octets); which got tricky at one point when we were still using YAML::Any as YAML::XS::Dump returned octets but YAML::Dump returned characters. At that point the "lowest common denominator" was to have them return bytes, which is what they currently do. As @ambs pointed out early in this issue, since these helpers return octets, if you return that output as the response from a route the Response object with encode it again.
We no longer use YAML::Any, so a rethink of what these helpers return is reasonable. It's also worth reading over the comments in #647; as the helpers (like to_json
) have some other issues..
Hi, I still have problems with the from_json and to_json helpers provided in the Dancer2::Serializer::JSON. Seems that the utf8 option is set to 1 if not specified, which doesn't work for me all the time. I was able to fix it by checking if the string it utf8 already:
if (!defined $options->{utf8}) { $options->{utf8} = utf8::is_utf8($entity) ? 0 : 1 ; }
@satmovi It is also possible to change the parameters via the config. The default is to enable utf8, but that's documented.
The Dancer2 v0.200000 release includes send_as
in core. The initial example provided by @ambs can be written (and do the right thing with respect to encoding):
use Dancer2;
get '/' => sub {
send_as JSON => { foo => "Simões" };
};
1;
@ambs, @DavsX: Would a doc patch resolve this now?
On 02/06/16 11:45, Russell Jenkins wrote:
The Dancer2 v0.200000 release includes |send_as| in core. The initial example provided by @ambs https://github.com/ambs can be written (and do the right thing with respect to encoding):
|use Dancer2; get '/' => sub { send_as JSON => { foo => "Simões" }; }; 1; |
@ambs https://github.com/ambs, @DavsX https://github.com/DavsX: Would a doc patch resolve this now?
Need to test, then I will say something O:-)
@ambs Did you get a chance to do some testing ?
sorry, not yet. Be free to close. I'll annoy again when I stump on it.
One more ping?
How about encode_json
vs. to_json
?
Think we can get this resolved for a release this week?
:+1: for suggestion by @xsawyerx
I run into the same problem (hidden by fuzzed encoding issues solved thanks to @tux) and correct flagged and encoded utf8 strings fed to Dancer2 were double encoded.
As a workaround I added
engines:
serializer:
JSON:
utf8: 0
to config.yml - which hides the problem. But the problem is, that already encoded utf-8 strings are encoded again, while:
# our identified pain
my $utf8_string = Encode::encode("utf8", "Hotline costs \N{EURO SIGN} 0,86 / min");
# works fine: {"foo":"Hotline costs € 0,86 / min"}
say JSON->new->encode({foo => $utf8_string});
# writes codepoints of utf8 bytes: {"foo":"Hotline costs \u00e2\u0082\u00ac 0,86 / min"}
say JSON->new->ascii->encode({foo => $utf8_string});
# double encodes: {"foo":"Hotline costs ⬠0,86 / min"}
say JSON->new->utf8->encode({foo => $utf8_string});
So conclusion: JSON seems to be able to do the right thing (TM) automatically.
A documentation update clarifying how to workaround the current state would be nice, but from my point of view it won't fix the issue itself.
We already have both to_json
and encode_json
as DSL keywords (the later added in commit 7c448c9e).
Their implementation is the same; wrapping JSON::MaybeXS::encode_json
, for which the three modules it may choose from (Cpanel::JSON::XS
, JSON::XS
or JSON::PP
) all document encode_json
as returning a UTF-8 encoded, binary string (that is, the string contains octets only)
. In my opinion, having the DSL keywords behave the same (with respect to encoding) as the default exported functions from those JSON modules is the correct behaviour.
The double encoding occurs when an octet string is returned from a route. The around
modifier on the response object content
attribute is going to encode that again in the current implementation. Note that JSON uses utf8::downgrade
( rather than Encode::encode ) which is documented as a no-op if the string was already in native format. That may be worth investigating if the app encoding is utf8; it may also have its own set of issues.
Again, when you don't have a serializer defined, I'd really encourage send_as JSON => $data_structure
if you are returning JSON from a route.
We use serializer => "JSON"
and always return utf8 encoded strings (octet can also be some kind of opaque snmp representation, what we don't use). And yes, those strings contain utf8 octets. And they're flagged correctly and JSON respects that flag (since our database returns that the right way).
When around
modifier [...]
shall return correct utf-8, and you do not trust the correctness of the is_utf8 flag, maybe use following snippet from @tux:
# from H.Merijn Brand
my $u = $s; # assume $s contains the string returned from the route
utf8::downgrade ($u);
my $g = $u;
$u = decode ("utf-8", $u);
# Check if it was double encoded
my $b = encode ("utf-8", $u);
my $final = $b eq $g ? $u : $g;
I would prefer that it it possible to configure whether such an correction is attempted - but of course an enabled default an acceptable approach.
Depending on the input data (utf16, ...) an utf8::downgrade
could be insane (resulting string contains wide characters), so I used Encode::_utf8_off
(mind the warning).