woodstox
woodstox copied to clipboard
Cannot encoding Supplementary Ideographic Plane Unicode characters
When try to encode a Supplementary Ideographic Plane Unicode characters e.g. Chinese character"𤰉"(0x24c09),com.ctc.wstx.exc.WstxLazyException occurred "Illegal character entity:expansion character (code 0xd853) not a valid XML character" 。It seems that validateChar() in com.ctc.wstx.sr.StreamScanner can't deal with unicode characters in supplementary ideographic plane which is expressed by Surrogate Pairs in UTF16 encoding form
I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.
I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.
XML uses utf8 encoding,so 0xD8** UTF16 is not allowed. But I think there must be someway to encode Supplementary Ideographic Plane Unicode characters.It seems that getIntEntity() in com.ctc.wstx.sr.StreamScanner deals with surrogate pair correctly. I will paste reproduction code later.
Thanks a lot.
protected EntityDecl getIntEntity(int ch, final char[] originalChars)
{
String cacheKey = new String(originalChars);
IntEntity entity = mCachedEntities.get(cacheKey);
if (entity == null) {
String repl;
if (ch <= 0xFFFF) {
repl = Character.toString((char) ch);
} else {
StringBuffer sb = new StringBuffer(2);
ch -= 0x10000;
sb.append((char) ((ch >> 10) + 0xD800));
sb.append((char) ((ch & 0x3FF) + 0xDC00));
repl = sb.toString();
}
entity = IntEntity.create(new String(originalChars), repl);
mCachedEntities.put(cacheKey, entity);
}
return entity;
}
But Woodstox should handle encoding from Java modified UCS-2, with surrogate pairs, into UTF-8 output (or whatever encoding is used) without problems. Unless these characters are outside range of what is expressible by 2 surrogate characters? While it may make sense to use internal character entities that should not be strictly necessary.
So: I am not aware of existing issues with surrogate pair handling; they should "just work" as expected. But it is of course possible there might be a bug somewhere.
No way to reproduce, closing. May be re-opened/re-filed with reproduction.