woodstox Cannot encoding Supplementary Ideographic Plane Unicode characters

Cannot encoding Supplementary Ideographic Plane Unicode characters

Open wssx-cloud opened this issue 2 years ago • 3 comments

When try to encode a Supplementary Ideographic Plane Unicode characters e.g. Chinese character"𤰉"(0x24c09)，com.ctc.wstx.exc.WstxLazyException occurred "Illegal character entity:expansion character (code 0xd853) not a valid XML character" 。It seems that validateChar() in com.ctc.wstx.sr.StreamScanner can't deal with unicode characters in supplementary ideographic plane which is expressed by Surrogate Pairs in UTF16 encoding form

Jun 10 '22 00:06 wssx-cloud

I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.

Jun 10 '22 00:06 cowtowncoder

I would need a reproduction in code form: existing should work as expected wrt surrogate pairs. But would this character actually be a valid XML character as per XML 1.0 (or 1.1) specification? Keeping in mind that not all valid Unicode codepoints are valid XML characters.

XML uses utf8 encoding,so 0xD8** UTF16 is not allowed. But I think there must be someway to encode Supplementary Ideographic Plane Unicode characters.It seems that getIntEntity() in com.ctc.wstx.sr.StreamScanner deals with surrogate pair correctly. I will paste reproduction code later.

Thanks a lot.

    protected EntityDecl getIntEntity(int ch, final char[] originalChars)
    {
        String cacheKey = new String(originalChars);

        IntEntity entity = mCachedEntities.get(cacheKey);
        if (entity == null) {
            String repl;
            if (ch <= 0xFFFF) {
                repl = Character.toString((char) ch);
            } else {
                StringBuffer sb = new StringBuffer(2);
                ch -= 0x10000;
                sb.append((char) ((ch >> 10)  + 0xD800));
                sb.append((char) ((ch & 0x3FF)  + 0xDC00));
                repl = sb.toString();
            }
            entity = IntEntity.create(new String(originalChars), repl);
            mCachedEntities.put(cacheKey, entity);
        }
        return entity;
    }

Jun 10 '22 06:06 wssx-cloud

But Woodstox should handle encoding from Java modified UCS-2, with surrogate pairs, into UTF-8 output (or whatever encoding is used) without problems. Unless these characters are outside range of what is expressible by 2 surrogate characters? While it may make sense to use internal character entities that should not be strictly necessary.

So: I am not aware of existing issues with surrogate pair handling; they should "just work" as expected. But it is of course possible there might be a bug somewhere.

Jun 14 '22 00:06 cowtowncoder

No way to reproduce, closing. May be re-opened/re-filed with reproduction.

Oct 19 '23 01:10 cowtowncoder

woodstox woodstox copied to clipboard

Cannot encoding Supplementary Ideographic Plane Unicode characters

woodstox
woodstox copied to clipboard