rest icon indicating copy to clipboard operation
rest copied to clipboard

JAXRS charset not set in response

Open nadiramra opened this issue 4 years ago • 7 comments

If the application is not setting content-type, and relies on the jaxrs (annotations), should not the server know that it needs to add charset tag by some configuration switch? In our JVM, it is started with -Dfile.encoding=UTF-8. You would think that the server would key off of that and add charset=UTF-8 if it has not been already set? By default if you do not specify charset it is ISO-8859-1.

nadiramra avatar Jun 09 '21 20:06 nadiramra

Afaik, the spec does not mention that the charset should be set, but looking through some test cases in Open Liberty, I do see a difference between CXF and RESTEasy. By default, CXF returns content types of "text/xml", "application/xml", "application/json" while RESTEasy returns those same content types, but appended with ";charset=UTF-8". I'm not sure yet, where RESTEasy gets that value.

Still, I think it would be useful for the spec to define whether and app should globally add the charset to the content type header - and what the global default should be. Using -Dfile.encoding seems reasonable to me.

andymc12 avatar Jun 14 '21 20:06 andymc12

Given that the default is iso8859-1, and that is a subset of UTF-8, I suppose you cannot go wrong to specify UTF-8. Both encode ASCII the same way.

It is interesting to note that when invoking SOAP it does add the charset=UTF-8, here is what Liberty soap service returns:

Content-Type: text/xml; charset=UTF-8

and this is independent of what -Dfile.encoding is set to.

Also, according to the json standard, https://datatracker.ietf.org/doc/rfc8259/ :

_JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability._

So I think it would be appropriate to add the charset=UTF-8 when returning JSON if and only if charset was not specified.

nadiramra avatar Jun 14 '21 21:06 nadiramra

I think such a default can be added pretty simple by a custom-made writer interceptor, so I do not see a need to amend the spec.

mkarg avatar Jun 15 '21 06:06 mkarg

Yes, I know one can do so, but why should one have to worry about that? Like I said, jaxws adds charset automatically. Why should we make a user have to do something that they really should not have to do? And again, by the standards, JSON is assumed to be in UTF-8. But right now, it is ambiguous to say the least. By HTTP standards, if no charset, then data is assumed to be ISO-8859-1. Basically we need to do the right thing. Not changing spec. Changing implementation so that it follows the spec.

nadiramra avatar Jun 15 '21 08:06 nadiramra

The use of charsets has been loosely defined for some time now. A JEP has been opened for a while about using a different default on the platform (https://bugs.openjdk.java.net/browse/JDK-8187041). If we can find a reasonable rule or suggestion to add to the Jakarata REST spec, I'm in favor of doing so.

spericas avatar Jun 15 '21 13:06 spericas

I cannot see a rationale. Jakarta REST is not bound to a specific content type, neither JSON nor another, and it should concentrate on foundational functionaly that user cannot easily bring into the play easily. Topics like this one should be covered by a "Commons" library. I could imagine that we establish such a library either at the EF or at Eclipse, but definitively it should not be part of Jakarta REST.

mkarg avatar Jun 15 '21 17:06 mkarg

All I am saying is that IF a user specifies JSON content-type (or for that matter XML), and does not specify charset, then it should be added as UTF-8. But not going to push it. Reason this is an issue at all is because a client thought the data was iso8859-1 (the default) since there was not charset set, but the payload actually contained national language characters that is not part of iso8859-1, and when conversion was attempted they got garbage.

nadiramra avatar Jun 15 '21 20:06 nadiramra