sqlite-jdbc
sqlite-jdbc copied to clipboard
the merge from https://github.com/xerial/sqlite-jdbc/pull/564 led to performance lost
As in in thread 564 report, nativeByteBuffer to string is a slow operation as the decoder get byte one by one via JNI. In my test in a large table(1m rows), lost 15% compare to previous version.
src for open jdk bytebuffer to string in sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)
According to http://www.java-performance.com/ Performance of various methods of binary serialization in Java: java.nio.ByteBuffer, sun.misc.Unsafe, java.io.DataInputStream, java.io.DataOutputStream, java.io.ByteArrayInputStream, java.io.ByteArrayOutputStream: comparison of binary serialization performance using various classes:
RULE1:(same for reading) It is extremely slow to write single bytes to direct byte buffers. You should avoid using direct byte buffers for writing records with mostly single byte fields.
note: direct bytebuffer has limit by jvm, -XX:MaxDirectMemorySize default to 64Mb in same jvm, but byte[] limit to 2Gb
π
Hey @wuhongzhi can you share your benchmarking code so I can run through it and confirm the performance loss? Also could you please include your JVM version?
PR #564 didn't add any new single-byte writes to a direct byte buffer - so it doesn't violate RULE 1 here. Instead, the patch allocates a new buffer around the existing data returned from SQLite via https://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-14.html#NewDirectByteBuffer. Then that data is read out of the ByteBuffer and converted to a String object in Java (NativeDB.java).
We've been running with the patch from #564 in production and our worst case read times have improved. Happy to take a look at this again if you're running into problems though.
Hello @tildedave, I have a test on bytes vs buffer to simulate your patch
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
public class BytesVsBuffer {
private static final byte[] data;
static {
data = "λ°μ΄νΈ λ²νΌλ μ§μ λλ λΉ μ§μ μ
λλ€. μ§μ λ°μ΄νΈ λ²νΌλ₯Ό κ°μν λ Java κ°μ μμ€ν
μ λ€μ΄ν°λΈ I/O μμ
μ μ§μ μννκΈ° μν΄ μ΅μ μ λ€ν κ²μ
λλ€. μ¦, λ²νΌμ μ½ν
μΈ λ₯Ό κΈ°λ³Έ μ΄μ 체μ μ κΈ°λ³Έ I/O μμ
μ€ νλλ₯Ό νΈμΆνκΈ° μ (λλ μ΄ν) μ€κ° λ²νΌμ 볡μ¬νμ§ μλλ‘ μλν©λλ€. "
.getBytes(StandardCharsets.UTF_8);
}
public static void main(String[] args) {
int round = 10_000_000;
long stime = System.nanoTime();
for (int i = 0; i < round; i++) {
bytesBenchmark();
}
long etime = System.nanoTime();
System.out.println(String.format("new String from bytes taken %,dns", etime - stime));
stime = System.nanoTime();
for (int i = 0; i < round; i++) {
bufferBenchmark();
}
etime = System.nanoTime();
System.out.println(String.format("new String from buffer taken %,dns", etime - stime));
}
public static String bytesBenchmark() {
return new String(Arrays.copyOf(data, data.length), StandardCharsets.UTF_8);
}
public static String bufferBenchmark() {
ByteBuffer buff = ByteBuffer.allocateDirect(data.length);
return StandardCharsets.UTF_8.decode(buff.put(data).flip()).toString();
}
}
Results ROUND1 new String from bytes taken 6,880,143,752ns new String from buffer taken 18,389,513,317ns
ROUND2 new String from bytes taken 7,046,609,094ns new String from buffer taken 18,531,096,240ns
ROUND3 new String from bytes taken 6,802,205,509ns new String from buffer taken 19,774,656,708ns
My env is linux(ubuntu 20.4) openjdk-11
some code extract from sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)
private CoderResult decodeBufferLoop(ByteBuffer src,
CharBuffer dst)
{
int mark = src.position();
int limit = src.limit();
while (mark < limit) {
int b1 = src.get();
if (b1 >= 0) {
// 1 byte, 7 bits: 0xxxxxxx
if (dst.remaining() < 1)
return xflow(src, mark, 1); // overflow
dst.put((char) b1);
mark++;
} else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
// 2 bytes, 11 bits: 110xxxxx 10xxxxxx
if (limit - mark < 2|| dst.remaining() < 1)
return xflow(src, mark, 2);
int b2 = src.get();
if (isNotContinuation(b2))
return malformedForLength(src, mark, 1);
dst.put((char) (((b1 << 6) ^ b2)
^
(((byte) 0xC0 << 6) ^
((byte) 0x80 << 0))));
mark += 2;
} else if ((b1 >> 4) == -2) {
// 3 bytes, 16 bits: 1110xxxx 10xxxxxx 10xxxxxx
int srcRemaining = limit - mark;
if (srcRemaining < 3 || dst.remaining() < 1) {
if (srcRemaining > 1 && isMalformed3_2(b1, src.get()))
return malformedForLength(src, mark, 1);
return xflow(src, mark, 3);
}
int b2 = src.get();
int b3 = src.get();
if (isMalformed3(b1, b2, b3))
return malformed(src, mark, 3);
char c = (char)
((b1 << 12) ^
(b2 << 6) ^
(b3 ^
(((byte) 0xE0 << 12) ^
((byte) 0x80 << 6) ^
((byte) 0x80 << 0))));
if (Character.isSurrogate(c))
return malformedForLength(src, mark, 3);
dst.put(c);
mark += 3;
} else if ((b1 >> 3) == -2) {
// 4 bytes, 21 bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
int srcRemaining = limit - mark;
if (srcRemaining < 4 || dst.remaining() < 2) {
b1 &= 0xff;
if (b1 > 0xf4 ||
srcRemaining > 1 && isMalformed4_2(b1, src.get() & 0xff))
return malformedForLength(src, mark, 1);
if (srcRemaining > 2 && isMalformed4_3(src.get()))
return malformedForLength(src, mark, 2);
return xflow(src, mark, 4);
}
int b2 = src.get();
int b3 = src.get();
int b4 = src.get();
int uc = ((b1 << 18) ^
(b2 << 12) ^
(b3 << 6) ^
(b4 ^
(((byte) 0xF0 << 18) ^
((byte) 0x80 << 12) ^
((byte) 0x80 << 6) ^
((byte) 0x80 << 0))));
if (isMalformed4(b2, b3, b4) ||
// shortest form check
!Character.isSupplementaryCodePoint(uc)) {
return malformed(src, mark, 4);
}
dst.put(Character.highSurrogate(uc));
dst.put(Character.lowSurrogate(uc));
mark += 4;
} else {
return malformed(src, mark, 1);
}
}
return xflow(src, mark, 0);
}
every buffer read led to jni call java.nio.DirectByteBuffer.get()
public byte get() {
try {
return ((UNSAFE.getByte(ix(nextGetIndex()))));
} finally {
Reference.reachabilityFence(this);
}
}
Hi @wuhongzhi, thanks for the code.
Your byte buffer benchmark is doing three things: allocating the buffer, putting data into it, and then reading data out of it. The cost of allocating the buffer and putting data into it is why your benchmark is slower, not the decoding method.
If you remove the allocations and copies from your benchmarking code to isolate the decoding method, it ends up being much faster to read out of the ByteBuffer. What I used here was:
private static final byte[] data;
private static final ByteBuffer buff;
static {
data = "λ°μ΄νΈ λ²νΌλ μ§μ λλ λΉ μ§μ μ
λλ€. μ§μ λ°μ΄νΈ λ²νΌλ₯Ό κ°μν λ Java κ°μ μμ€ν
μ λ€μ΄ν°λΈ I/O μμ
μ μ§μ μννκΈ° μν΄ μ΅μ μ λ€ν κ²μ
λλ€. μ¦, λ²νΌμ μ½ν
μΈ λ₯Ό κΈ°λ³Έ μ΄μ 체μ μ κΈ°λ³Έ I/O μμ
μ€ νλλ₯Ό νΈμΆνκΈ° μ (λλ μ΄ν) μ€κ° λ²νΌμ 볡μ¬νμ§ μλλ‘ μλν©λλ€. "
.getBytes(StandardCharsets.UTF_8);
buff = ByteBuffer.allocateDirect(data.length);
buff.put(data).flip();
}
/* main() method unchanged */
public static String bytesBenchmark() {
return new String(data, StandardCharsets.UTF_8);
}
public static String bufferBenchmark() {
return StandardCharsets.UTF_8.decode(buff).toString();
}
and I got the result (openjdk 11.0.9.1, Intel Macbook Big Sur):
dave@zwingli benchmark % java -classpath . BytesVsBuffer
new String from bytes taken 7,110,828,188ns
new String from buffer taken 165,582,666ns
OK, so, to my patch - my patch is not allocating a new byte buffer with empty data; it allocates a new byte buffer around the pointers returned from SQLite.
Are you seeing a performance decrease when you use the sqlite-jdbc library, or is this just a best practice we should be aware of when doing similar work?
@tildedave , TO answer your question, YES, my test is on lib, my example code about just to show you where the problem is.
according to your modification, there is an issue when bufferBenchmark() runs more then once (buff is position == limit led to empty string). After modify your the code, i run it again
static {
....
buff = ByteBuffer.allocateDirect(data.length);
buff.put(data);
}
public static String bufferBenchmark() {
buff.flip();
return StandardCharsets.UTF_8.decode(buff).toString();
}
The result is new String from bytes taken 8,291,965,852ns new String from buffer taken 9,634,591,090ns
almost equals to my first report 15% performance declined, please check that, thanks.
Ah, makes sense that the buffer needs to be reset each test run. Thanks.
My guess from the ByteBuffer#decode code you've shared is that it's re-decoding the bytes in the buffer as if they were UTF8, but we shouldn't need to do an extra decoding pass in Java since SQLite has already put the data into UTF8 format - we really just need to put the byte data into the Java string. I'll see if I can make a patch with that. Thanks again for all the info, and sorry about the performance regression.
Alright, beware of Java's UTF-8 in jni does not totally equal to SQLite's UTF-8, https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8
detail disciss is here https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
What's the consensus on this ? Is this still relevant?
We merged https://github.com/xerial/sqlite-jdbc/pull/575 which had a slight improvement in the performance over the original merge. My memory is (but I guess I didn't write in on the Issue or Pull Request, sorry) that there was still something like a 5% performance degradation over the original behavior. @wuhongzhi had some more complicated C code in that PR that might be faster.