ktor
ktor copied to clipboard
application/json default charset is not UTF-8 when parsing request
This is a follow up to #80. Apparently, responses were fixed, but there are still issues with the requests.
Request with Content-Type: application/json is decoded not as UTF-8, while request with Content-Type: application/json; charset=utf-8 is decoded correctly. Both has to behave the same and be decoded as UTF-8.
It should work properly if you have Gson or Jackson support installed and receiving serializable type so in this case content negotiation and custom converter does handling well. However if you are receiving a string (or receiveText()) then charsets are always handled according to plain text rules and the default for plain text is ISO-8859-1
It doesn't work properly, I tried both. Parsing works, however the resulting string values (in a deserialized object) are not correct - they're not looking right when printed. I assume they're in ISO-8859-1, but I didn't check it.
Just added tests: both Gson and Jackson handle encodings properly however both do not cover receiveText and receive<String>
I was using receive<MyClass> syntax, where MyClass has String fields.
So, is it supposed to work?
I definitely expected for it to just work. I mean, I expected Content-Type: application/json to be enough here for deserialization to switch to proper JSON (with UTF-8 instead of ISO-8859-1).
revceive<MyClass> should work, I can't reproduce it, could you please provide example?
Sure, but not now.
Are you using ktor-client or server-side only ktor?
Server-side only. I use that deserialized object with Apache PDFBox later on to fill in the forms in PDFs.
Any progress?
This is an extract from my code:
embeddedServer(Netty, port = getPort()) {
install(DefaultHeaders)
install(CallLogging)
install(CORS) {
anyHost()
}
install(ContentNegotiation) {
jackson {
registerKotlinModule()
}
}
routing {
Do I use ContentNegotiation and jackson properly?
Actually, this is my whole main fn (simplified):
fun main(args: Array<String>) {
val pdfContentType = ContentType("application", "pdf")
embeddedServer(Netty, port = getPort()) {
install(DefaultHeaders)
install(CallLogging)
install(CORS) {
anyHost()
}
install(ContentNegotiation) {
register(ContentType.Application.Json, JacksonConverter())
}
routing {
post("/") {
val task = call.receive<Task>()
worker.load().use {
execTask(it, task)
val stream = ByteArrayOutputStream()
it.save(stream)
call.respond(Content(
bytes = stream.toByteArray(),
contentType = pdfContentType
))
}
}
}
}.start(wait = true)
}
The error I have occurs deep in the execTask, and the issue there is that Task (a data-class object) have stings that are not encoded in the UTF-8 (if I pass application/json as a Content-Type).
However, when I pass application/json; charset=utf-8 as a Content-Type, the resulting Task has strings in valid UTF-8 encoding (no error occurs).
Note that the resulting strings with incorrect encoding seem to be binary the same - they're just getting interpreted incorrectly. This applies to multi-byte characters (Cyrillic symbols in my case).
Effectively I get the following error (it's not from ktor, but it might help understand what the issue with encoding is): java.lang.IllegalArgumentException: No glyph for U+0081 in font LiberationSans.
The character that causes an error in this case should be U+0441 (https://www.compart.com/en/unicode/U+0441). It was supposed to be in the string (instead of, I guess, U+00D1 U+0081).
Sample request body: {"a":"тест"}
I can confirm. I have the same issue with Github webhooks. Github doesn't set encoding for application/json payload (by standard json is UTF8, so maybe this is the reason of that) I use call.receiveText() to read payload and check the signature, but Ktor reads payload in wrong encoding: … becomes â¦
I use CIO and Gson (but do not read request using it, only for responses).
Ktor version 0.9.3
I still trying to find a reproducible example, looks that it's some other factors involved. Still, using a blocking call.receiveStream() temporary solved my problem, but I will try to report what cause this issue
I could reproduce this issue, please check this sample project: https://github.com/gildor/ktor-encoding-bug
Oh, I just got what @cy6erGn0m means in his comment. So, it's expected behavior. But still doesn't sound like a good and user-friendly even if it follows HTTP standard. JSON by RFC standard is must be encoded in UTF-8. Maybe make sense to open another issue and reconsider current behavior. Even for non-json content types, I hardly can imagine ISO-8859-1 as default encoding in Modern Web, and in most cases it will just cause bugs.
@gildor what makes you think it's expected behavior? I think there are tests that try to test json to be parsed into UTF-8, but they don't handle this case for some reason.
@MOZGIII According to this comment, this is expected behavior for my case: read application/json body as String using call.receiveText
Oh, got it. That's not the case for me though, since I use call.receive<T>().
I think the default behavior is incorrect. Either we should not call receiveText for JSON input (meaning we should have a proper alternative), or all the calls (i.e. receiveText) should have correct default behavior. I think so because I can affect the actual encoding used if I pass charset=utf-8, and then things would be parsed properly, so my thinking is that having sane defaults is actually very important here. And it's not something you would want to abstract from.
An alternative is a better API that would allow me to specify the encoding I want to parse with. In certain cases it's easier for me to tell what encoding I want, than for system to deduce it. Especially if it's broken.
@MOZGIII
since I use call.receive<T>()
Actually, receiveText also uses call.receive<String>(), in case of Gson/Jackson there is different behavior based on assumption of UTF-8 encoding for Json.
Yes, I completely agree with you. I solved my problem by reading as ByteArray and converting it to String and specify the encoding explicitly (actually ByteArray.toString() by default uses UTF-8), but current behavior is very error-prone and sometimes hard to understand what is went wrong: in my case signature check failed sporadically for different events (for ones with UTF characters)
Use this function if you just want to "have it work":
/**
* Receive the request as String.
* If there is no Content-Type in the HTTP header specified use ISO_8859_1 as default charset, see https://www.w3.org/International/articles/http-charset/index#charset.
* But use UTF-8 as default charset for application/json, see https://tools.ietf.org/html/rfc4627#section-3
*/
private suspend fun ApplicationCall.receiveTextWithCorrectEncoding(): String {
fun ContentType.defaultCharset(): Charset = when (this) {
ContentType.Application.Json -> Charsets.UTF_8
else -> Charsets.ISO_8859_1
}
val contentType = request.contentType()
val suitableCharset = contentType.charset() ?: contentType.defaultCharset()
return receiveStream().bufferedReader(charset = suitableCharset).readText()
}
@cy6erGn0m Maybe you want to change the default implementation of receiveText to the implementation of receiveTextWithCorrectEncoding()?
@HttpClientDsl
open class HttpClientEngineConfig {
/**
* The [CoroutineDispatcher] that will be used for the client requests.
*/
@Deprecated(
"Custom dispatcher is deprecated. Consider using threadsCount instead.",
level = DeprecationLevel.ERROR
)
var dispatcher: CoroutineDispatcher? = null
/**
* Network threads count
*/
@KtorExperimentalAPI
var threadsCount: Int = 4
/**
* Enable http pipelining
*/
var pipelining: Boolean = false
/**
* Configuration for http response.
*/
val response: HttpResponseConfig = HttpResponseConfig()
}
@cy6erGn0m Can you change the response from val to var? then we can custom the HttpResponseConfig and Charset, otherwise val HttpResponseConfig in HttpClientEngineConfig is meaningless, thanks
Looks like #1008 has solved this issue upstream. ð
I was also wondering about the val response like @zhuosun-rally , but I solved with:
HttpClient(Apache) {
engine {
response.apply {
defaultCharset = Charsets.UTF_8
}
}
install(JsonFeature) {
serializer = JacksonSerializer()
}
}
I'll check if my issue is resolved later today and post an update here.
Looks like #1008 has solved this issue upstream. ð I was also wondering about the
val responselike @zhuosun-rally , but I solved with:HttpClient(Apache) { engine { response.apply { defaultCharset = Charsets.UTF_8 } } install(JsonFeature) { serializer = JacksonSerializer() } }
I tried this way, but it does not work in my side, it will be overwritten, at the end, I update the HttpClientEngine make it works
var okHttpEngine = OkHttp.create {
'do something'
}.apply {
config.response.defaultCharset = Charset.defaultCharset()
}
HttpClient(okHttpEngine) {
install(SomeFeature) {
}
}
I'm thinking it will be easier if HttpResponseConfig is var ^^.
I updated my code to use ktor 1.1.3, and it seems like the issue does not appear anymore. That said, I'm not sure if it's fixed or if I forgot how to reproduce it. I'll do more testing in the next few days.
Also, seems like my initial issue had nothing to do with http client, not sure if the last few comments are related to what I encountered initially. If that's the case I'd recomment opening another issue.
I have run into this problem as well. The receiveTextWithCorrectEncoding() work-around solved it for now, but it seems like it is still an issue.
Same here, but in my case I was having issues with content received from Google's PubSubHubbub notification hub server
Please check the following ticket on YouTrack for follow-ups to this issue. GitHub issues will be closed in the coming weeks.