play-ws
play-ws copied to clipboard
json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset
JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8
.
However, many servers (like Play framework itself) uses application/json
without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.
I think that problem is in JsonBodyReadables
implementation. Json.parse(response.body)
should be replaced with Json.parse(response.bodyAsBytes.utf8string)
. Charset for response.body
is always ISO-8859-1 in async http client when no charset info is provided in content type header.
Good catch.
Technically, JSON is not always UTF-8, but it is always Unicode, and application/json
does not define a charset
parameter. I would use Json.parse(response.bodyAsBytes.toArray)
directly, since it will detect the encoding.
I've run into the same issue, where response.body
does not correctly handle an emoji character. I'm not using play json to map the response body (manual jackson with scala module). Is the appropriate approach to use response.bodyAsBytes.toArray
to get the raw bytes, which I can pass to Jackson, and use new String(response.bodyAsBytes.toArray)
to get something for a log statement? I'm guessing that response.body
is causing play to parse to JSON and then output to String
, which is causing the encoding issue.
Is there an ETA on a release that includes this fix?
@Malax within the next couple of weeks -- however, do note that due to the way that the standalone client works, it's fairly easy to create your own custom body readables and writables as a workaround.
@wsargent Thanks for the reply! 👍 If it's in the next couple of weeks, we will just delay the migration to 2.6 until then.
@wsargent Can you comment on my workaround above for accessing the raw bytes when I don't want to use the play JSON parser?
@jsw You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray)
which does UTF-8/UTF-16 encoding detection correctly.
You are broadly correct -- the best option is to use
Json.parse(response.bodyAsBytes.toArray)
which does UTF-8/UTF-16 encoding detection correctly.
Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!
Unless I'm doing things wrong, I think the issue still exists in play-ws 1.0.1.
import org.scalatestplus.play.PlaySpec
import play.api.mvc.Action
import play.api.mvc.Results.Ok
import play.api.test.Helpers._
import play.api.test.WsTestClient
import play.core.server.Server
import scala.concurrent.ExecutionContext.Implicits.global
class JsonEncodingTest extends PlaySpec {
"ws" should {
"handle utf8" in {
Server.withRouter() {
case _ => Action { Ok("""{"foo": "👍"}""").as(JSON) }
} { implicit port =>
WsTestClient.withClient { client =>
println(await(client.url("/").get().map(_.body.toString))) // unexpected output
println(await(client.url("/").get().map(r => new String(r.bodyAsBytes.toArray)))) // expected output
}
}
}
}
}
Output
{"foo": "ð"}
{"foo": "👍"}
You should be calling
client.url("/").get().map(r => Json.parse(r.bodyAsBytes.toArray))
Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!
In theory yes, in practice array copying usually doesn't yield much in the way of performance improvements because the JVM is pretty effective at array copying:
https://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html
If you're interested, you can poke at the JVM overhead with JMH:
https://psy-lob-saw.blogspot.com/p/jmh-related-posts.html
There's a sbt plugin at https://github.com/ktoso/sbt-jmh that will help.
@wsargent My use cases are:
- obtain a
String
orArray[Byte]
to pass to Jackson mapper (I'm not using play-json) - obtain a
String
suitable for logging the response body
Given that, I'm not sure why I would want to do an extra parse via Json.parse
and then convert that back to a String
. Can you elaborate?
Also, can you explain why r.body.toString
produces different output than new String(r.bodyAsBytes.toArray)
?
@wsargent Thanks for that valuable information. :+1:
Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?
Json.parse detects the encoding of the data and decodes them appropriately.
Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?
r.body
decodes the data using the iso-8859-1 encoding which is the default behaviour of the AsyncHttpClient, AFAIK.
new String(...)
uses the default platform encoding to decode the data, which is probably UTF-8 on Linux, CP1250 on Windows, et cetera.
The remaining problem being discussed here is basically that the body: String
method is not smart enough to understand the rules of every content type, and I'm not sure if it should be. There are a few options here:
- Restore the old behavior in which we read the charset parameter (regardless of whether the media type defines one) and default to UTF-8 for non
text
content types and ISO-8859-1 fortext
. That would cover a lot of cases, but it would still be broken for JSON sent as UTF-16, and virtually any other content type that allows charsets to be declared in another way (e.g. HTML can have charsets embedded in meta tags). - Instead of having a default implicit
BodyReadable[String]
, we can have multiple String BodyReadables for specific cases, including one for parsing a JSON body to a string. That could use the logic here for charset detection: https://github.com/playframework/play-ws/pull/151/files/9155ef716b3c1d5699783e2aa2edf1c8b8260970#diff-0b357986f490a94794eef6c192426fd9R29
The issue still exists in play-ws 1.1.1. Another possible solution would be to change the body
type to ByteString :speak_no_evil:
somewhat related... how can we make the server include the charset instead of trying to be too smart for its own good?
See possible fix on the Play server side: https://github.com/playframework/playframework/issues/8239