woodstox icon indicating copy to clipboard operation
woodstox copied to clipboard

Cannot encoding Supplementary Ideographic Plane Unicode characters

Open wssx-cloud opened this issue 2 years ago • 3 comments

When try to encode a Supplementary Ideographic Plane Unicode characters e.g. Chinese character"𤰉"(0x24c09),com.ctc.wstx.exc.WstxLazyException occurred "Illegal character entity:expansion character (code 0xd853) not a valid XML character" 。It seems that validateChar() in com.ctc.wstx.sr.StreamScanner can't deal with unicode characters in supplementary ideographic plane which is expressed by Surrogate Pairs in UTF16 encoding form

wssx-cloud avatar Jun 10 '22 00:06 wssx-cloud

I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.

cowtowncoder avatar Jun 10 '22 00:06 cowtowncoder

I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.

XML uses utf8 encoding,so 0xD8** UTF16 is not allowed. But I think there must be someway to encode Supplementary Ideographic Plane Unicode characters.It seems that getIntEntity() in com.ctc.wstx.sr.StreamScanner deals with surrogate pair correctly. I will paste reproduction code later.

Thanks a lot.

    protected EntityDecl getIntEntity(int ch, final char[] originalChars)
    {
        String cacheKey = new String(originalChars);

        IntEntity entity = mCachedEntities.get(cacheKey);
        if (entity == null) {
            String repl;
            if (ch <= 0xFFFF) {
                repl = Character.toString((char) ch);
            } else {
                StringBuffer sb = new StringBuffer(2);
                ch -= 0x10000;
                sb.append((char) ((ch >> 10)  + 0xD800));
                sb.append((char) ((ch & 0x3FF)  + 0xDC00));
                repl = sb.toString();
            }
            entity = IntEntity.create(new String(originalChars), repl);
            mCachedEntities.put(cacheKey, entity);
        }
        return entity;
    }

wssx-cloud avatar Jun 10 '22 06:06 wssx-cloud

But Woodstox should handle encoding from Java modified UCS-2, with surrogate pairs, into UTF-8 output (or whatever encoding is used) without problems. Unless these characters are outside range of what is expressible by 2 surrogate characters? While it may make sense to use internal character entities that should not be strictly necessary.

So: I am not aware of existing issues with surrogate pair handling; they should "just work" as expected. But it is of course possible there might be a bug somewhere.

cowtowncoder avatar Jun 14 '22 00:06 cowtowncoder

No way to reproduce, closing. May be re-opened/re-filed with reproduction.

cowtowncoder avatar Oct 19 '23 01:10 cowtowncoder