Simple-Web-Server
Simple-Web-Server copied to clipboard
Character encoding in request body
Hi,
Thanks for creating this simple and fun to use web server :)
I have the following issue with character encoding in the request body: I send JSON string in the request body. When I in my string have characters that are not part of the original ASCII table (e.g. some French characters) they are not getting correctly through.
For example, I am sending the following string to the server: "allécher quelqu'un" And on the server side I am getting this: "allécher quelqu'un"
Any idea?
Your doing something wrong then. see http://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean. All JSON should be Unicode usually UTF8. It look like your JSON library does not conform to the norm. Your client does indeed send Unicode (UTF8 or UTF16) while your server read this as if it was WE8ISO8859P15 (or CP1252). I'm guessing your server is running Windows (while Windows is working in UTF16 internally since a while, it is still managing process in local charset, CP1252 in your case ; all process are working in UTF8 on most linux distribution by default).
Now how to fix is all an other question.
PS: being french myself, I do enjoy the example you choosed :D
If I understand your answer correctly - you suggest that the issue is with the sending side rather than the receiving side.
I believe I narrowed down the issue to be sure that it's in the receiving side (issue with Simple-Web-Server or at least something I didn't figure out how to configure). Here is what I narrowed down:
- Instead of using my app for sending the request I am using Postman.
- I made sure I am sending "Content-type: application/json; charset=utf-8"
- When sending the request to a ruby server I have (instead of to the C++ Simple-Web-Server) the request is coming in fine.
PS: I searched google for "French Text" and this was in the first result. Not sure what does it mean :)
Sorry, I've explain myself wrong it seems. Let's try to clarify and dig into your issue. Every process have a default charset. That default depends on the actual OS and configuration. Unless specified otherwise, all process guess that the data they are manupulating is in the currently configured default charset. Your client and server appear (more on that later) to have different default charset. Simple-Web-Server doesn't provide any translation between charset (as far as i'm aware; eidheim will probably tell you more about this). All it does is providing you the "advertised" charset of the data in the Request. You'll need to do the conversion yourself (unless eidheim choose to implement this in (t)his library). See GNU iconv or boost.Locale. Putting "Content-type: application/json; charset=utf-8" in the HTTP header doesnt actually mean the data itself is utf-8. You better check that 1st before advertizing it is. If the server was able to do the translation for you and you are wrongly advertising the charset, then the conversion will end-up in an even worst situation (say you send CP1252 characters saying it's UTF8 data, and then the server try to convert this data to CP1252 thinking it's UTF8 then every non-ascii caracters will be an utter mess). Luckyly for you, all that is happening is a missing conversion somewhere (or maybe a wrong conversion, but at this point I strongly doubt it). Now a bunch of question help you dig into your issue :
- How the text "allécher quelqu'un" ended in the client ? (maybe the issue is even before the client<-> server exchange)
- From a text file readed at runtime or within the source ? Then what is the charset of file (on linux, use "file <filename>" to know) ? What is the charset of your editor ? Did you transfert the file using FTP text mode ?
- Are you using a terminal emulator (like putty) ? if so, what is it's configured charset ?
- How the string is being displayed at the server side ? (same sub-questions ; maybe the issue is a wrong translation after the client <-> server exchange)
Now that you've narrowed down the issue somewhere between the client and the server, try to find who's guilty and who should do the conversion.
- What is the current charset of the client process (sending the data) ? is it actually UTF8 ? If the original file is UTF8 and the process is running UTF8, then the client is guiltless and is right to advertize UTF8 data. Otherwise the client should to the conversion to UTF8 so the server can actually assume the client is not lying.
- What is the current charset of the server process (receiving the data) ? is it actually UTF8 ? if not then the data have to be converted there My guess goes for that later case, but it's just a guess...
I hope that wall of text help a bit. Charset conversion is always a very complex matter and one need to be sure before converting data from one to an other.
EDIT: back to the original reason of my posting here : compare https://translate.google.fr/#fr/en/all%C3%A9cher%20quelqu%27un (well I would probably translate it to "entice someone" myself) and https://translate.google.fr/#fr/en/%C3%A0%20l%C3%A9cher%20quelqu%27un ;)
EDIT2: rereading your reply... I could have made my wall of text way shorter. Your ruby test show that the server is indeed the one that should be doing the conversion. Simple-Web-Server doesn't do it for you, but shouldnt either as your JSON library probably expect UTF8 data when reading the JSON. The conversion have to happen after the JSON parsingbased on what is advertised on the HTTP headers.
Thanks for your detailed answer! You got me rolling... :) Here is the new stuff I learned:
- I read the text like this:
request->content.string();. This returns std::string and if I send "allécher quelqu'un" it reads "allécher quelqu'un". - When I convert the response I get from
request->content.string();to std::wstring it finally reads the string correctly!
Some questions I still have:
- This means that the server is getting UTF-16 and not UTF-8, right?
- Do I need to convert my entire code to use std::wstring instead of ws::string (that's a big headache)?
- BUT, when I write the following code in my program:
std::string french_text = "allécher quelqu'un"it looks fine and the debugger and doesn't change to "allécher quelqu'un". So this makes me think that std::string can handle these characters... What am I missing here???
P.S. I found the following translation to allécher quelqu'un: to give someone an appetite. Does it make sense? :)
P.S.S I am using Windows (not Linux).
I know a thing or 2 about charset because of my other (day-time) life (see my profil pic). I actually never had the issue while coding so take my words with care. (and never cared to code for the "old" OS :P)
Your server is receiving UTF8 data. The character "é" is 2 bytes long coded in UTF8. And the "Ã" and "©" are the representation of these 2 bytes in CP1252 (which only code characters in single a byte). I guessed that you're using windows on the server, because on linux, it would have stayed "é" as the default charset there is utf8. Charset on Windows is a bit more complex than on linux ;) Let have an other guess : the client isnt on windows :P From what I understand with what you said here. If you actually requiere any charset conversion it would be from UTF8 to CP1252 on you server side so the rendering of the string become correct. From my reading std::wstring is based on wchar_t which, on Windows should be UTF-16. So there might still be a problem in your current approch. "é" is coded with the same 2 bytes in UTF8 and UTF16. But there are hundreds of characters that are coded differently in these 2 charsets. (hafl an hour ago I wasnt even aware of the existance of wstring, so beware :P) It sound like on windows, the default interpretation of a std::string is the local national charset, aka CP1252 in western countries, while the std::wstring is interpreted as internal unicode string which, as far as I know, is UTF16. So previously you had UTF8 readed as if it was CP1252. Now you have UTF8 data readed as if it was UTF16 data. While this is better, it's still damn wrong. You'll have to test with chineses characters to be sure.
As far as solution is concerned, converting your whole code to std::wstring sound like the 1st step. You might need to add a convertion from UTF8 to UTF16 depending on the results of the test I suggested.
PS: well that would be one possible meaning. Most of the time it's closer to "to entice someone" with its sexual connotation. But then there is "a lécher quelqu'un" which sound exactly the same said out-loud but have an explicite sexual meaning (lick someone). It's a french word-play ;)
EDIT: about your 3rd question, in what charset is encoded the source-file ? Are you sure your compiler isnt doing the conversion for you ? My guess are : the source file is actually encoded in CP1252, so stored in a single byte character string in your binary and is rendered by the CP1252 string renderer. So the result is correct. But the string never happen to have that "é" stored as a 2bytes character anytime. The thing is at the end of the day a string is just an array of bytes ended by a 0. Characters are some sort of drawing, a visual representation. What make a string a list of characters displayed on screen is a decoding process using a map. These maps are what we call a charset, and there a huge bunch of these :) PPS: thanks for making me dig into this, I learned a rope or 2 in the process
I read the text like this: request->content.string();. This returns std::string and if I send "allécher quelqu'un" it reads "allécher quelqu'un". When I convert the response I get from request->content.string(); to std::wstring it finally reads the string correctly!
I can tell you that the issue is not with the sender or receiver. I am 99.999% sure that the binary data you have in the std::string variable is correct utf8.
The problem is the printing routine (are you printing to the console perhaps?) that does not handle / understand that you are printing utf8. Try writing the string to a file and open it in a utf8 aware editor and if you see the correct text then you are all good. Well, given that you find a utf8 aware lib to display the content.
The console on windows is pretty bad. I write lots of apps that deal with utf8 text and I never care about how it looks in the console because I only use it for debugging and know it will look like crap there.