Dancer2
Dancer2 copied to clipboard
`charset` config option is mostly ignored
The documentation says the following about the charset
config option:
It sets the default charset of outgoing content. charset= item will be added to Content-Type response header.
It makes Unicode bodies in HTTP responses of text/* types to be encoded to this charset.
It also indicates to Dancer2 in which charset the static files and templates are encoded.
If you're using Dancer2::Plugin::Database, UTF-8 support will automatically be enabled for your database - see "AUTOMATIC UTF-8 SUPPORT" in Dancer2::Plugin::Database
A quick grep through the code reveals that this is not at all true:
$ grep -PR 'config->.*charset' Dancer2*
Dancer2/Core/Role/Logger.pm: $config->{'charset'} || 'UTF-8',
Dancer2/Core/Role/Logger.pm: $config->{'charset'} || 'UTF-8',
Dancer2/Core/Role/Logger.pm: $config->{'charset'} || 'UTF-8',
Dancer2/Core/App.pm: $content_type .= "; charset=" . ( $self->config->{charset} || "utf-8" );
We can see that $config->{charset}
is used only for two purposes:
- Setting the logger encoding (this is not documented)
- Trying to set the charset= in the Content-Type header (but this doesnât actually work)
In most of the places where $config->{charset} should be used, we instead find a hard-coded UTF-8
:
Dancer2/Core/Response.pm: $self->content_type("$ct; charset=UTF-8")
Dancer2/Core/Response.pm: return Encode::encode( 'UTF-8', $content );
This is the worst offender. It doesnât even check if the string is already in UTF-8 before calling encode, occasionally resulting in doubly-encoded é
garbage. Just donât do that.
Dancer2/Core/Role/Template.pm: default => sub {'UTF-8'},
Although it says default
, this is actually never overriden, forcing TT to use UTF-8.
Dancer2/Core/Error.pm: default => sub {'UTF-8'},
This is never overridden either, but itâs actually fine, since itâs only used for the default error page, not for custom ones.
Dancer2/Serializer/YAML.pm: encode('UTF-8', YAML::Dump($entity));
Dancer2/Serializer/YAML.pm: YAML::Load(decode('UTF-8', $content));
This looks like itâs going directly against the YAML spec (âOn input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supportedâ), but Iâm not sure what the consequences are.
Overall, this means that Dancer2 in completely unusable unless you have the UTF-8 bit set on all strings you pass to it. This will usually be the case (since Dancer2 forcibly imports utf8
into its caller), but thatâs a rather huge restriction, if you ask me.
Actually there is nothing wrong with using just UTF-8 in a modern web application - in my opinion. But you are right that the documentation and the code doesn't match and this is clearly a bug.
Thanks for your investigation, it clearly shows that this configuration option is now quite bogus.
I personally have no problem using UTF-8 everywhere (although some people might want to use other encodings). I actually encountered this bug while working in a fully UTF-8 environment (sources, inputs, outputs, locale and terminal all in UTF-8).
The issue is with properly encoded UTF-8 data without the UTF-8 bit set: when passing such data to Dancer2, it will end up doubly encoded. This might make some kind of sense, but it runs contrary to what Perl’s core and other framework do.
It’d be nice having a config option that means “Please don’t re-encode data behind my back, I know what I’m doing” flag. Right now the only workaround I’ve found is to manually call Encode::_utf8_on on the affected strings, which is ugly.
Apologies for the documentation for the charset
option doesn't match the implementation. (My suspicion is that the docs were copied form D1 but never updated).
"Best practice" is to use character strings (not bytes/octets) within a modern app. For a modern perl web app, using UTF-8 characters internally is the only real choice (that's my opinion, and TIMTOWTDI applies, of course).
That implies decoding all inputs to characters, and encoding the output to byte strings (response must be bytes) at the app boundaries. You can't rely on the internal utf8 flag either, as its possible to have properly encoded utf8 character strings without the internal utf8 flag set. eg: perl -CO -E'$str = "\xe9\xe9"; say utf8::is_utf8($str); say $str';
One must understand what their incoming data is.
The content
keyword assumes you are adding utf8 characters to the response body. If you are getting data from some datastore that is bytes and you are not modifying it before returning it from the response, use the send_file
keyword with an inline filehandle. Otherwise, ensure that data is decoded to characters as it enters your app, using Encode::decode_utf8
or similar, rather than toggling the internal flag, or look at setting the is_encoded
flag in the response object before using the content
keyword. YMMV.
Regarding the YAML serializer; YAML.pm (which we use to keep the core fat-packable) doesn't adhere to several parts of the YAML spec, character encodings being one example, making the decoding/encoding calls are necessary workaround.
Just ran into this. Been 4 years already but the bug is still there. Got double utf8 encoding when migrating our api to dancer2, which broke integrations for a lot of customers. While doc says that when charset param is not set it does nothing it actually turns data into utf8.
I confirm this problem too. Had spend several hours trying to solve this double utf8 encoding bug. Finally found this issue on github and workaround whith Encode::_utf8_on function. This is helped in my case.
I am also facing this issue. Seems like it is not yet fixed!