play-ws icon indicating copy to clipboard operation
play-ws copied to clipboard

json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset

Open maxcom opened this issue 7 years ago • 18 comments

JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8.

However, many servers (like Play framework itself) uses application/json without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.

I think that problem is in JsonBodyReadables implementation. Json.parse(response.body) should be replaced with Json.parse(response.bodyAsBytes.utf8string). Charset for response.body is always ISO-8859-1 in async http client when no charset info is provided in content type header.

maxcom avatar Jun 23 '17 15:06 maxcom

Good catch.

wsargent avatar Jun 23 '17 18:06 wsargent

Technically, JSON is not always UTF-8, but it is always Unicode, and application/json does not define a charset parameter. I would use Json.parse(response.bodyAsBytes.toArray) directly, since it will detect the encoding.

gmethvin avatar Jun 23 '17 19:06 gmethvin

I've run into the same issue, where response.body does not correctly handle an emoji character. I'm not using play json to map the response body (manual jackson with scala module). Is the appropriate approach to use response.bodyAsBytes.toArray to get the raw bytes, which I can pass to Jackson, and use new String(response.bodyAsBytes.toArray) to get something for a log statement? I'm guessing that response.body is causing play to parse to JSON and then output to String, which is causing the encoding issue.

jsw avatar Jun 27 '17 19:06 jsw

Is there an ETA on a release that includes this fix?

Malax avatar Jul 04 '17 12:07 Malax

@Malax within the next couple of weeks -- however, do note that due to the way that the standalone client works, it's fairly easy to create your own custom body readables and writables as a workaround.

wsargent avatar Jul 04 '17 23:07 wsargent

@wsargent Thanks for the reply! 👍 If it's in the next couple of weeks, we will just delay the migration to 2.6 until then.

Malax avatar Jul 05 '17 08:07 Malax

@wsargent Can you comment on my workaround above for accessing the raw bytes when I don't want to use the play JSON parser?

jsw avatar Jul 05 '17 16:07 jsw

@jsw You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

wsargent avatar Jul 05 '17 18:07 wsargent

You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

avdv avatar Jul 06 '17 06:07 avdv

Unless I'm doing things wrong, I think the issue still exists in play-ws 1.0.1.

import org.scalatestplus.play.PlaySpec
import play.api.mvc.Action
import play.api.mvc.Results.Ok
import play.api.test.Helpers._
import play.api.test.WsTestClient
import play.core.server.Server
import scala.concurrent.ExecutionContext.Implicits.global

class JsonEncodingTest extends PlaySpec {
  "ws" should {
    "handle utf8" in {
      Server.withRouter() {
        case _  => Action { Ok("""{"foo": "👍"}""").as(JSON) }
      } { implicit port =>
        WsTestClient.withClient { client =>
          println(await(client.url("/").get().map(_.body.toString))) // unexpected output
          println(await(client.url("/").get().map(r => new String(r.bodyAsBytes.toArray)))) // expected output
        }
      }
    }
  }
}

Output

{"foo": "👍"}
{"foo": "👍"}

jsw avatar Jul 07 '17 23:07 jsw

You should be calling

client.url("/").get().map(r => Json.parse(r.bodyAsBytes.toArray))

wsargent avatar Jul 08 '17 05:07 wsargent

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

In theory yes, in practice array copying usually doesn't yield much in the way of performance improvements because the JVM is pretty effective at array copying:

https://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html

If you're interested, you can poke at the JVM overhead with JMH:

https://psy-lob-saw.blogspot.com/p/jmh-related-posts.html

There's a sbt plugin at https://github.com/ktoso/sbt-jmh that will help.

wsargent avatar Jul 08 '17 05:07 wsargent

@wsargent My use cases are:

  • obtain a String or Array[Byte] to pass to Jackson mapper (I'm not using play-json)
  • obtain a String suitable for logging the response body

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

jsw avatar Jul 08 '17 18:07 jsw

@wsargent Thanks for that valuable information. :+1:

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Json.parse detects the encoding of the data and decodes them appropriately.

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

r.body decodes the data using the iso-8859-1 encoding which is the default behaviour of the AsyncHttpClient, AFAIK.

new String(...) uses the default platform encoding to decode the data, which is probably UTF-8 on Linux, CP1250 on Windows, et cetera.

avdv avatar Jul 10 '17 07:07 avdv

The remaining problem being discussed here is basically that the body: String method is not smart enough to understand the rules of every content type, and I'm not sure if it should be. There are a few options here:

  • Restore the old behavior in which we read the charset parameter (regardless of whether the media type defines one) and default to UTF-8 for non text content types and ISO-8859-1 for text. That would cover a lot of cases, but it would still be broken for JSON sent as UTF-16, and virtually any other content type that allows charsets to be declared in another way (e.g. HTML can have charsets embedded in meta tags).
  • Instead of having a default implicit BodyReadable[String], we can have multiple String BodyReadables for specific cases, including one for parsing a JSON body to a string. That could use the logic here for charset detection: https://github.com/playframework/play-ws/pull/151/files/9155ef716b3c1d5699783e2aa2edf1c8b8260970#diff-0b357986f490a94794eef6c192426fd9R29

gmethvin avatar Aug 14 '17 22:08 gmethvin

The issue still exists in play-ws 1.1.1. Another possible solution would be to change the body type to ByteString :speak_no_evil:

tartakynov avatar Sep 19 '17 16:09 tartakynov

somewhat related... how can we make the server include the charset instead of trying to be too smart for its own good?

fommil avatar Feb 16 '18 17:02 fommil

See possible fix on the Play server side: https://github.com/playframework/playframework/issues/8239

richdougherty avatar Feb 20 '18 11:02 richdougherty