tinyxml2
tinyxml2 copied to clipboard
obtain unicode character
I'm working with a project need to parse unicode-characters, I got a xml-file almost like this:
<map>
<unicode v="λ"/>
<unicode v="a"/>
</map>
The value of attribute "v" in node "unicode" is an unicode character, (λ and a actually in above), the question is there's no function to access it (even if with ASCII-encoding, like 'a'), I try to code like:
const XMLElement* root = _doc.RootElement();
const XMLElement* e = root->FirstChildElement("unicode");
int v = 0; // or wchar_t
// add a function like QueryCharAttribute that can read unicode character ?
int err = e->QueryIntAttribute("v", &v); // I got a error 'XML_WRONG_ATTRIBUTE'
and I tried this:
const char* s = e->Attribute("v");
wchar_t c = (wchar_t)s[0]; // I got a wrong value for 'λ'
I know there's a stupid way to solve this problem, just replace the unicode-character with it's code point, the xml would be like:
<map>
<unicode v="955"/>
<unicode v="96"/>
</map>
but is there any more efficient way to achieve my goal? Thanks!
:+1:
Get attribute as string and convert from utf8
@kleuter It's a tough thing to handle with character encoding in c++ (actually I'm struggling with it...), it would be the best way to solve my problem so far... thank you a lot for your patience :+1:
I don't see how this can be fixed other than storing the decoded integer value of the pointer inside the attribute "just in case". Current code sees entity encoding and crafts UTF-8 and proper dealing with that UTF-8 is totally non-trivial.
TinyXML-2 is unicode (UTF-8) pure. It doesn't support UTF-16 or UCS-2, and won't. Character encoding is just too big, and should be done outside of TinyXML-2.
@leethomason Actually there is code inside that can handle this. And the problem is not with coding strings back and forth, it's about getting a single character (perhaps as int
).
Fair point; I get caught up on the UTF-16 thing. Returning as int (UTF-32) is actually a pretty reasonable API. I'm not sure what the overlap is between UTF-32 and UTF-16 - need to do some research there - but if they mostly overlap it's probably pretty useful.