[Java] implement fast utf16 to utf8 conversion
Is your feature request related to a problem? Please describe.
Currently Fury use java.lang.StringCoding#encode(java.nio.charset.Charset, char[], int, int) to convert utf16 to utf8.
static byte[] encode(Charset cs, char[] ca, int off, int len) {
CharsetEncoder ce = cs.newEncoder();
int en = scale(len, ce.maxBytesPerChar());
byte[] ba = new byte[en];
if (len == 0)
return ba;
boolean isTrusted = false;
if (System.getSecurityManager() != null) {
if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) {
ca = Arrays.copyOfRange(ca, off, off + len);
off = 0;
}
}
ce.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.reset();
if (ce instanceof ArrayEncoder) {
int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);
return safeTrim(ba, blen, cs, isTrusted);
} else {
ByteBuffer bb = ByteBuffer.wrap(ba);
CharBuffer cb = CharBuffer.wrap(ca, off, len);
try {
CoderResult cr = ce.encode(cb, bb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = ce.flush(bb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
throw new Error(x);
}
return safeTrim(ba, bb.position(), cs, isTrusted);
}
}
This invoke sun.nio.cs.UTF_8.Encoder#encode:
public int encode(char[] sa, int sp, int len, byte[] da) {
int sl = sp + len;
int dp = 0;
int dlASCII = dp + Math.min(len, da.length);
// ASCII only optimized loop
while (dp < dlASCII && sa[sp] < '\u0080')
da[dp++] = (byte) sa[sp++];
while (sp < sl) {
char c = sa[sp++];
if (c < 0x80) {
// Have at most seven bits
da[dp++] = (byte)c;
} else if (c < 0x800) {
// 2 bytes, 11 bits
da[dp++] = (byte)(0xc0 | (c >> 6));
da[dp++] = (byte)(0x80 | (c & 0x3f));
} else if (Character.isSurrogate(c)) {
if (sgp == null)
sgp = new Surrogate.Parser();
int uc = sgp.parse(c, sa, sp - 1, sl);
if (uc < 0) {
if (malformedInputAction() != CodingErrorAction.REPLACE)
return -1;
da[dp++] = repl;
} else {
da[dp++] = (byte)(0xf0 | ((uc >> 18)));
da[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f));
da[dp++] = (byte)(0x80 | ((uc >> 6) & 0x3f));
da[dp++] = (byte)(0x80 | (uc & 0x3f));
sp++; // 2 chars
}
} else {
// 3 bytes, 16 bits
da[dp++] = (byte)(0xe0 | ((c >> 12)));
da[dp++] = (byte)(0x80 | ((c >> 6) & 0x3f));
da[dp++] = (byte)(0x80 | (c & 0x3f));
}
}
return dp;
}
This implementation is not effficient enough, we need a faster one.
Describe the solution you'd like
Additional context
@chaokunyang , can I work on this ?
@chaokunyang , can I work on this ?
Of course, feel free to take over it.
Can I work on this issue ? If yes, I need the class name, I'm new to the contibution and looking forward to working on Fury.
Hi @FormerKinG , thanks for the willingnesee to contribute to Apache Fury. You can take org.apache.fury.serializer.StringSerializer as the start point
High version JDKs (17+) has significantly improved String encoding and decoding performance [1][2]. I'm wondering if anyone has compared the optimized version of String encoding impl with that of JDK's.
- [1] https://bugs.openjdk.org/browse/JDK-8261744
- [2] https://bugs.openjdk.org/browse/JDK-8275863
@HuangXingBo could we benchmark with jdk17?