pugixml icon indicating copy to clipboard operation
pugixml copied to clipboard

provide some encoding conversion functions

Open Lysander opened this issue 8 years ago • 10 comments

One thing that disturbs me when working with pugi is, that the lib is based upon an internal string encoding, that is either UTF-8 (for char) or UTF-16/32 (for wchar). As the rest of the C++ world still haven't agreed upon a standard encoding for Strings (also there is no datatype for that yet!), one has to work often with applications that rely on other encodings internally than that. So in order to work with pugi, one has always to convert between the used encoding of once program and the chosen UTF-* encoding used by pugi for string data.

It would be great, if those conversion functions would be part of the public API in order to make it easier to do those stuff! (They must be existent, but I haven't understood / found the location in the source where those conversions happen).

I would think of something like that:

// decode a source string into UTF-* (based upon ``PUGIXML_WCHAR_MODE``)
std::string decode(const std::string &value, xml_encoding encoding);

// encode a pugi internal string from UTF-* into the chosen destination encoding
std::string encode(const std::string &value, xml_encoding encoding);

That would be imho a big improvement!

Lysander avatar Aug 15 '17 09:08 Lysander

pugixml does provide basic encoding/decoding functions: as_utf8 and as _wide. Not sure if they fit your usecase - if not, can you describe it more? (that is, what encodings do you need to convert from/to)

zeux avatar Aug 15 '17 09:08 zeux

Basically you use some well defined encoding (as UTF-8 hand in hand with the char type) as internal encoding for text data in your lib.

Some program using your lib might use a different encoding internally - most C++ programs probably the default system encoding. On a windows machine that is typically windows-1252 (aka cp1252) in western europe.

So in order to grab data out of the XML-DOM tree provided by your lib and to process this further, one must convert the string-data from UTF-8 to windows-1252 - to stay closely to the above given examples. The same is true vice versa for manipulating or creating data.

As you convert allready some encodings during the loading and saving process, you could probably expose them in order to support my described use case better! (Some more encodings for saving and loading would also be great - especially the above mentioned windows-1252 is important for almost all Windows devs in western europe! And it is sadly not the same as iso-8859-1 aka latin1)

I hope you it is clear now?

If not, I must think deeper about how to explain that better :-D

Lysander avatar Aug 15 '17 17:08 Lysander

Let me make sure I understand this correctly...

  1. You have an XML file with XML contents encoded using utf-8
  2. You are loading this XML file with pugixml compiled in utf8 mode (PUGIXML_WCHAR_MODE is not defined)
  3. Your application works with strings encoded using windows-1252
  4. ... thus you need encoding conversion functions between utf-8 and windows-1252

Is this accurate?

Re: windows-1252 support, this is definitely possible to implement. In general I prefer to only support Unicode encodings but if there are particular popular 8-bit encodings that aren't too hard to support it may be worthwhile.

zeux avatar Aug 18 '17 21:08 zeux

Is this accurate?

Yes!

Re: windows-1252 support, this is definitely possible to implement. In general I prefer to only support Unicode encodings but if there are particular popular 8-bit encodings that aren't too hard to support it may be worthwhile.

Sounds good!

The problem is, that there are legacy application en masse in the wild. If you need some lib you must be sure you can operate with it. Thus as you decided to use UTF-* internally, every application that works with a different encoding must convert its data.

Btw. I am quite sure that there are lost of applications out there, who even haven't figured out that they must convert the data in order to use your lib correctly! Perhaps you are a Linux dev only? Because on Windows there is no system unicode encoding per default! They keep on using there old perverted codepages... :-(

Lysander avatar Aug 19 '17 09:08 Lysander

On the other hand it's not too hard to wrap MultiByteToWideChar() and WideCharToMultiByte() in a usable interface and use the result for conversion. At least that's the path I chose to take.

brandl-muc avatar Aug 19 '17 10:08 brandl-muc

Right, so I believe that in this case we're explicitly talking about a legacy Windows application - a modern Windows application would probably use UTF-16 and *W style APIs, or UTF-8.

In this case I'm sort of reluctant to add this - it's not clear why pugixml should support legacy applications by providing easy to use functions to convert arbitrary strings back and forth between UTF-8 and specifically Windows 1252. For a legacy Win32 applications I'd expect it to use MultiByteToWideChar with CP_ACP encoding (which is not Windows 1252 in general - it just defaults to one for many Windows installations).

zeux avatar Aug 19 '17 18:08 zeux

Btw. I am quite sure that there are lost of applications out there, who even haven't figured out that they must convert the data in order to use your lib correctly!

I believe that if you're referring to the need to convert data to Windows 1252 then this is a misconception - it's incorrect to do that, it's correct to use Unicode Windows APIs. pugixml helps with that by providing two options:

  1. PUGIXML_WCHAR_MODE - I believe a lot of Windows software uses pugixml in this mode because then you can just use wchar_t strings everywhere and pass them directly to Windows API functions
  2. pugi::as_wide - some applications prefer to use UTF-8 internally, and call pugi::as_wide before passing strings to Windows API

In both cases the correct Windows API functions to use are *W (wide char variants), not ANSI variants.

zeux avatar Aug 19 '17 18:08 zeux

I believe that if you're referring to the need to convert data to Windows 1252 then this is a misconception - it's incorrect to do that, it's correct to use Unicode Windows APIs

I would call this just legacy ;-)

To be serious: Why is it only correct to use Unicode Windows API (which is also some kind of a mess - read utf8everywhere)? Our code compiles without warning - so if MS would not allow this kind or mode anymore, why it works?

Look, the application I work on is about 17 years old - it would be tremendously challanging to change the internal text data into wchar_t and use UTF-16; that's because of the code size and so many systems we communicate with, where we would also carefully have to check if and what recoding would be needed. Of coures lot's of stuff isn't automatically tested yet. And of course there are many many dirty hacks, that assume limited contraints about strings.

I really don't like C++ because of many aspects - one is the lack of a really unicode abstraction. But as we need to deal with the reality, we need to find ways to ge along. I had simply the idea that exposing functionality that you allready own, would be a nice feature.

Lysander avatar Aug 20 '17 12:08 Lysander

Re: windows-1252 support, this is definitely possible to implement. In general I prefer to only support Unicode encodings but if there are particular popular 8-bit encodings that aren't too hard to support it may be worthwhile.

To remember you onto this one ;-) Even if the original proposal is nothing you wanna add, this one would definitly be nice!

Lysander avatar Aug 21 '17 11:08 Lysander

Well, there seems to be a disconnect in that pugixml currently doesn't have the functionality you are looking for - none of the encoding conversions implemented internally are adequate for exchanging data between pugixml and Windows ANSI APIs.

If pugixml gets windows-1252 support and if your application only supports windows-1252 and if we're still talking about UTF8-encoded documents then sure, this can be beneficial. But since you could get similar functionality via MultiByteToWideChar it's not clear to me why specifically that's beneficial to include in pugixml :)

Obviously I'm not suggesting that you rework your application in any way, but applications that use ANSI Windows APIs and assume that ANSI=Windows-1252 are inherently legacy from my point of view in that they don't support a lot of languages. Even assuming that ANSI=current Windows locale is better (for example, my native language is Russian; if your application works in ANSI and assumes it's Windows 1252 I can't use it; if your application works in ANSI but uses the current locale for all encoding/decoding operations then I can still use it even though it doesn't use Unicode); the ideal situation of course is to use Unicode, which - in Windows world - means UTF-16. So the API that you're proposing seems to help in cases where there are other ways to do this (via Windows APIs) and seems to support the worst possible use case from internationalization standpoint.

This is very different from supporting Windows-1252 internally for encoding conversion purposes (because that just says "I have XML documents that happen to use Windows-1252 that happens to be more popular than Latin-1 so why not support it" - makes sense).

zeux avatar Aug 30 '17 04:08 zeux