ktor icon indicating copy to clipboard operation
ktor copied to clipboard

application/json default charset is not UTF-8 when parsing request

Open MOZGIII opened this issue 7 years ago • 30 comments

This is a follow up to #80. Apparently, responses were fixed, but there are still issues with the requests.

Request with Content-Type: application/json is decoded not as UTF-8, while request with Content-Type: application/json; charset=utf-8 is decoded correctly. Both has to behave the same and be decoded as UTF-8.

MOZGIII avatar Apr 09 '18 08:04 MOZGIII

It should work properly if you have Gson or Jackson support installed and receiving serializable type so in this case content negotiation and custom converter does handling well. However if you are receiving a string (or receiveText()) then charsets are always handled according to plain text rules and the default for plain text is ISO-8859-1

cy6erGn0m avatar Apr 16 '18 09:04 cy6erGn0m

It doesn't work properly, I tried both. Parsing works, however the resulting string values (in a deserialized object) are not correct - they're not looking right when printed. I assume they're in ISO-8859-1, but I didn't check it.

MOZGIII avatar Apr 16 '18 09:04 MOZGIII

Just added tests: both Gson and Jackson handle encodings properly however both do not cover receiveText and receive<String>

cy6erGn0m avatar Apr 16 '18 09:04 cy6erGn0m

I was using receive<MyClass> syntax, where MyClass has String fields. So, is it supposed to work?

MOZGIII avatar Apr 16 '18 10:04 MOZGIII

I definitely expected for it to just work. I mean, I expected Content-Type: application/json to be enough here for deserialization to switch to proper JSON (with UTF-8 instead of ISO-8859-1).

MOZGIII avatar Apr 16 '18 10:04 MOZGIII

revceive<MyClass> should work, I can't reproduce it, could you please provide example?

cy6erGn0m avatar Apr 16 '18 10:04 cy6erGn0m

Sure, but not now.

MOZGIII avatar Apr 16 '18 10:04 MOZGIII

Are you using ktor-client or server-side only ktor?

cy6erGn0m avatar Apr 16 '18 10:04 cy6erGn0m

Server-side only. I use that deserialized object with Apache PDFBox later on to fill in the forms in PDFs.

MOZGIII avatar Apr 16 '18 10:04 MOZGIII

Any progress?

MOZGIII avatar Jul 03 '18 10:07 MOZGIII

This is an extract from my code:

    embeddedServer(Netty, port = getPort()) {
        install(DefaultHeaders)
        install(CallLogging)
        install(CORS) {
            anyHost()
        }
        install(ContentNegotiation) {
            jackson {
                registerKotlinModule()
            }
        }
        routing {

Do I use ContentNegotiation and jackson properly?

MOZGIII avatar Jul 03 '18 10:07 MOZGIII

Actually, this is my whole main fn (simplified):

fun main(args: Array<String>) {
    val pdfContentType = ContentType("application", "pdf")

    embeddedServer(Netty, port = getPort()) {
        install(DefaultHeaders)
        install(CallLogging)
        install(CORS) {
            anyHost()
        }
        install(ContentNegotiation) {
            register(ContentType.Application.Json, JacksonConverter())
        }
        routing {
            post("/") {
                val task = call.receive<Task>()

                worker.load().use {
                    execTask(it, task)

                    val stream = ByteArrayOutputStream()
                    it.save(stream)

                    call.respond(Content(
                        bytes = stream.toByteArray(),
                        contentType = pdfContentType
                    ))
                }
            }
        }
    }.start(wait = true)
}

The error I have occurs deep in the execTask, and the issue there is that Task (a data-class object) have stings that are not encoded in the UTF-8 (if I pass application/json as a Content-Type). However, when I pass application/json; charset=utf-8 as a Content-Type, the resulting Task has strings in valid UTF-8 encoding (no error occurs).

Note that the resulting strings with incorrect encoding seem to be binary the same - they're just getting interpreted incorrectly. This applies to multi-byte characters (Cyrillic symbols in my case).

Effectively I get the following error (it's not from ktor, but it might help understand what the issue with encoding is): java.lang.IllegalArgumentException: No glyph for U+0081 in font LiberationSans. The character that causes an error in this case should be U+0441 (https://www.compart.com/en/unicode/U+0441). It was supposed to be in the string (instead of, I guess, U+00D1 U+0081).

MOZGIII avatar Jul 03 '18 11:07 MOZGIII

Sample request body: {"a":"тест"}

MOZGIII avatar Jul 03 '18 11:07 MOZGIII

I can confirm. I have the same issue with Github webhooks. Github doesn't set encoding for application/json payload (by standard json is UTF8, so maybe this is the reason of that) I use call.receiveText() to read payload and check the signature, but Ktor reads payload in wrong encoding: becomes … I use CIO and Gson (but do not read request using it, only for responses). Ktor version 0.9.3

gildor avatar Aug 10 '18 04:08 gildor

I still trying to find a reproducible example, looks that it's some other factors involved. Still, using a blocking call.receiveStream() temporary solved my problem, but I will try to report what cause this issue

gildor avatar Aug 10 '18 05:08 gildor

I could reproduce this issue, please check this sample project: https://github.com/gildor/ktor-encoding-bug

gildor avatar Aug 10 '18 06:08 gildor

Oh, I just got what @cy6erGn0m means in his comment. So, it's expected behavior. But still doesn't sound like a good and user-friendly even if it follows HTTP standard. JSON by RFC standard is must be encoded in UTF-8. Maybe make sense to open another issue and reconsider current behavior. Even for non-json content types, I hardly can imagine ISO-8859-1 as default encoding in Modern Web, and in most cases it will just cause bugs.

gildor avatar Aug 10 '18 07:08 gildor

@gildor what makes you think it's expected behavior? I think there are tests that try to test json to be parsed into UTF-8, but they don't handle this case for some reason.

MOZGIII avatar Aug 11 '18 11:08 MOZGIII

@MOZGIII According to this comment, this is expected behavior for my case: read application/json body as String using call.receiveText

gildor avatar Aug 11 '18 11:08 gildor

Oh, got it. That's not the case for me though, since I use call.receive<T>(). I think the default behavior is incorrect. Either we should not call receiveText for JSON input (meaning we should have a proper alternative), or all the calls (i.e. receiveText) should have correct default behavior. I think so because I can affect the actual encoding used if I pass charset=utf-8, and then things would be parsed properly, so my thinking is that having sane defaults is actually very important here. And it's not something you would want to abstract from.

An alternative is a better API that would allow me to specify the encoding I want to parse with. In certain cases it's easier for me to tell what encoding I want, than for system to deduce it. Especially if it's broken.

MOZGIII avatar Aug 11 '18 11:08 MOZGIII

@MOZGIII

since I use call.receive<T>()

Actually, receiveText also uses call.receive<String>(), in case of Gson/Jackson there is different behavior based on assumption of UTF-8 encoding for Json.

Yes, I completely agree with you. I solved my problem by reading as ByteArray and converting it to String and specify the encoding explicitly (actually ByteArray.toString() by default uses UTF-8), but current behavior is very error-prone and sometimes hard to understand what is went wrong: in my case signature check failed sporadically for different events (for ones with UTF characters)

gildor avatar Aug 13 '18 01:08 gildor

Use this function if you just want to "have it work":

/**
 * Receive the request as String.
 * If there is no Content-Type in the HTTP header specified use ISO_8859_1 as default charset, see https://www.w3.org/International/articles/http-charset/index#charset.
 * But use UTF-8 as default charset for application/json, see https://tools.ietf.org/html/rfc4627#section-3
 */
private suspend fun ApplicationCall.receiveTextWithCorrectEncoding(): String {
  fun ContentType.defaultCharset(): Charset = when (this) {
    ContentType.Application.Json -> Charsets.UTF_8
    else -> Charsets.ISO_8859_1
  }
  
  val contentType = request.contentType()
  val suitableCharset = contentType.charset() ?: contentType.defaultCharset()
  return receiveStream().bufferedReader(charset = suitableCharset).readText()
}

@cy6erGn0m Maybe you want to change the default implementation of receiveText to the implementation of receiveTextWithCorrectEncoding()?

functionaldude avatar Jan 29 '19 13:01 functionaldude

@HttpClientDsl
open class HttpClientEngineConfig {
    /**
     * The [CoroutineDispatcher] that will be used for the client requests.
     */
    @Deprecated(
        "Custom dispatcher is deprecated. Consider using threadsCount instead.",
        level = DeprecationLevel.ERROR
    )
    var dispatcher: CoroutineDispatcher? = null

    /**
     * Network threads count
     */
    @KtorExperimentalAPI
    var threadsCount: Int = 4

    /**
     * Enable http pipelining
     */
    var pipelining: Boolean = false

    /**
     * Configuration for http response.
     */
    val response: HttpResponseConfig = HttpResponseConfig()
}

@cy6erGn0m Can you change the response from val to var? then we can custom the HttpResponseConfig and Charset, otherwise val HttpResponseConfig in HttpClientEngineConfig is meaningless, thanks

zhuosun-rally avatar Mar 19 '19 05:03 zhuosun-rally

Looks like #1008 has solved this issue upstream. 🙂 I was also wondering about the val response like @zhuosun-rally , but I solved with:

HttpClient(Apache) {
  engine {
    response.apply {
      defaultCharset = Charsets.UTF_8
    }
  }
  install(JsonFeature) {
    serializer = JacksonSerializer()
  }
}

jvassbo avatar Mar 21 '19 10:03 jvassbo

I'll check if my issue is resolved later today and post an update here.

MOZGIII avatar Mar 21 '19 13:03 MOZGIII

Looks like #1008 has solved this issue upstream. 🙂 I was also wondering about the val response like @zhuosun-rally , but I solved with:

HttpClient(Apache) {
  engine {
    response.apply {
      defaultCharset = Charsets.UTF_8
    }
  }
  install(JsonFeature) {
    serializer = JacksonSerializer()
  }
}

I tried this way, but it does not work in my side, it will be overwritten, at the end, I update the HttpClientEngine make it works

var okHttpEngine = OkHttp.create {
      'do something'
    }.apply {
      config.response.defaultCharset = Charset.defaultCharset()
    }

HttpClient(okHttpEngine) {
     install(SomeFeature) {
      }
}

I'm thinking it will be easier if HttpResponseConfig is var ^^.

zhuosun-rally avatar Mar 21 '19 17:03 zhuosun-rally

I updated my code to use ktor 1.1.3, and it seems like the issue does not appear anymore. That said, I'm not sure if it's fixed or if I forgot how to reproduce it. I'll do more testing in the next few days.

Also, seems like my initial issue had nothing to do with http client, not sure if the last few comments are related to what I encountered initially. If that's the case I'd recomment opening another issue.

MOZGIII avatar Mar 21 '19 20:03 MOZGIII

I have run into this problem as well. The receiveTextWithCorrectEncoding() work-around solved it for now, but it seems like it is still an issue.

clydebarrow avatar Feb 14 '20 19:02 clydebarrow

Same here, but in my case I was having issues with content received from Google's PubSubHubbub notification hub server

MrPowerGamerBR avatar Apr 15 '20 16:04 MrPowerGamerBR

Please check the following ticket on YouTrack for follow-ups to this issue. GitHub issues will be closed in the coming weeks.

oleg-larshin avatar Aug 10 '20 15:08 oleg-larshin