sqlite-jdbc the merge from https://github.com/xerial/sqlite-jdbc/pull/564 led to performance lost

As in in thread 564 report, nativeByteBuffer to string is a slow operation as the decoder get byte one by one via JNI. In my test in a large table(1m rows), lost 15% compare to previous version.

src for open jdk bytebuffer to string in sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)

According to http://www.java-performance.com/ Performance of various methods of binary serialization in Java: java.nio.ByteBuffer, sun.misc.Unsafe, java.io.DataInputStream, java.io.DataOutputStream, java.io.ByteArrayInputStream, java.io.ByteArrayOutputStream: comparison of binary serialization performance using various classes:

RULE1:(same for reading) It is extremely slow to write single bytes to direct byte buffers. You should avoid using direct byte buffers for writing records with mostly single byte fields.

Dec 25 '20 04:12 wuhongzhi

note: direct bytebuffer has limit by jvm, -XX:MaxDirectMemorySize default to 64Mb in same jvm, but byte[] limit to 2Gb

Dec 25 '20 09:12 wuhongzhi

👀

Dec 25 '20 18:12 xerial

Hey @wuhongzhi can you share your benchmarking code so I can run through it and confirm the performance loss? Also could you please include your JVM version?

PR #564 didn't add any new single-byte writes to a direct byte buffer - so it doesn't violate RULE 1 here. Instead, the patch allocates a new buffer around the existing data returned from SQLite via https://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-14.html#NewDirectByteBuffer. Then that data is read out of the ByteBuffer and converted to a String object in Java (NativeDB.java).

We've been running with the patch from #564 in production and our worst case read times have improved. Happy to take a look at this again if you're running into problems though.

Jan 05 '21 21:01 tildedave

Hello @tildedave, I have a test on bytes vs buffer to simulate your patch

import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

public class BytesVsBuffer {
	private static final byte[] data;
	static {
		data = "바이트 버퍼는 직접 또는 비 직접입니다. 직접 바이트 버퍼를 감안할 때 Java 가상 시스템은 네이티브 I/O 작업을 직접 수행하기 위해 최선을 다할 것입니다. 즉, 버퍼의 콘텐츠를 기본 운영 체제의 기본 I/O 작업 중 하나를 호출하기 전(또는 이후) 중간 버퍼에 복사하지 않도록 시도합니다. "
					.getBytes(StandardCharsets.UTF_8);
	}
	public static void main(String[] args) {
		int round = 10_000_000;
		long stime = System.nanoTime();
		for (int i = 0; i < round; i++) {
			bytesBenchmark();
		}
		long etime = System.nanoTime();
		System.out.println(String.format("new String from bytes taken %,dns", etime - stime));
		stime = System.nanoTime();
		for (int i = 0; i < round; i++) {
			bufferBenchmark();
		}
		etime = System.nanoTime();
		System.out.println(String.format("new String from buffer taken %,dns", etime - stime));
	}
	public static String bytesBenchmark() {
		return new String(Arrays.copyOf(data, data.length), StandardCharsets.UTF_8);
	}
	public static String bufferBenchmark() {
		ByteBuffer buff = ByteBuffer.allocateDirect(data.length);
		return StandardCharsets.UTF_8.decode(buff.put(data).flip()).toString();
	}
}

Results ROUND1 new String from bytes taken 6,880,143,752ns new String from buffer taken 18,389,513,317ns

ROUND2 new String from bytes taken 7,046,609,094ns new String from buffer taken 18,531,096,240ns

ROUND3 new String from bytes taken 6,802,205,509ns new String from buffer taken 19,774,656,708ns

My env is linux(ubuntu 20.4) openjdk-11

Jan 07 '21 00:01 wuhongzhi

some code extract from sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)

    private CoderResult decodeBufferLoop(ByteBuffer src,
                                         CharBuffer dst)
    {
        int mark = src.position();
        int limit = src.limit();
        while (mark < limit) {
            int b1 = src.get();
            if (b1 >= 0) {
                // 1 byte, 7 bits: 0xxxxxxx
                if (dst.remaining() < 1)
                    return xflow(src, mark, 1); // overflow
                dst.put((char) b1);
                mark++;
            } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
                // 2 bytes, 11 bits: 110xxxxx 10xxxxxx
                if (limit - mark < 2|| dst.remaining() < 1)
                    return xflow(src, mark, 2);
                int b2 = src.get();
                if (isNotContinuation(b2))
                    return malformedForLength(src, mark, 1);
                 dst.put((char) (((b1 << 6) ^ b2)
                                ^
                                (((byte) 0xC0 << 6) ^
                                 ((byte) 0x80 << 0))));
                mark += 2;
            } else if ((b1 >> 4) == -2) {
                // 3 bytes, 16 bits: 1110xxxx 10xxxxxx 10xxxxxx
                int srcRemaining = limit - mark;
                if (srcRemaining < 3 || dst.remaining() < 1) {
                    if (srcRemaining > 1 && isMalformed3_2(b1, src.get()))
                        return malformedForLength(src, mark, 1);
                    return xflow(src, mark, 3);
                }
                int b2 = src.get();
                int b3 = src.get();
                if (isMalformed3(b1, b2, b3))
                    return malformed(src, mark, 3);
                char c = (char)
                    ((b1 << 12) ^
                     (b2 <<  6) ^
                     (b3 ^
                      (((byte) 0xE0 << 12) ^
                       ((byte) 0x80 <<  6) ^
                       ((byte) 0x80 <<  0))));
                if (Character.isSurrogate(c))
                    return malformedForLength(src, mark, 3);
                dst.put(c);
                mark += 3;
            } else if ((b1 >> 3) == -2) {
                // 4 bytes, 21 bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                int srcRemaining = limit - mark;
                if (srcRemaining < 4 || dst.remaining() < 2) {
                    b1 &= 0xff;
                    if (b1 > 0xf4 ||
                        srcRemaining > 1 && isMalformed4_2(b1, src.get() & 0xff))
                        return malformedForLength(src, mark, 1);
                    if (srcRemaining > 2 && isMalformed4_3(src.get()))
                        return malformedForLength(src, mark, 2);
                    return xflow(src, mark, 4);
                }
                int b2 = src.get();
                int b3 = src.get();
                int b4 = src.get();
                int uc = ((b1 << 18) ^
                          (b2 << 12) ^
                          (b3 <<  6) ^
                          (b4 ^
                           (((byte) 0xF0 << 18) ^
                            ((byte) 0x80 << 12) ^
                            ((byte) 0x80 <<  6) ^
                            ((byte) 0x80 <<  0))));
                if (isMalformed4(b2, b3, b4) ||
                    // shortest form check
                    !Character.isSupplementaryCodePoint(uc)) {
                    return malformed(src, mark, 4);
                }
                dst.put(Character.highSurrogate(uc));
                dst.put(Character.lowSurrogate(uc));
                mark += 4;
            } else {
                return malformed(src, mark, 1);
            }
        }
        return xflow(src, mark, 0);
    }

every buffer read led to jni call java.nio.DirectByteBuffer.get()

public byte get() {
    try {
        return ((UNSAFE.getByte(ix(nextGetIndex()))));
    } finally {
        Reference.reachabilityFence(this);
    }
}

Jan 07 '21 01:01 wuhongzhi

Hi @wuhongzhi, thanks for the code.

Your byte buffer benchmark is doing three things: allocating the buffer, putting data into it, and then reading data out of it. The cost of allocating the buffer and putting data into it is why your benchmark is slower, not the decoding method.

If you remove the allocations and copies from your benchmarking code to isolate the decoding method, it ends up being much faster to read out of the ByteBuffer. What I used here was:

    private static final byte[] data;
    private static final ByteBuffer buff;

    static {
        data = "바이트 버퍼는 직접 또는 비 직접입니다. 직접 바이트 버퍼를 감안할 때 Java 가상 시스템은 네이티브 I/O 작업을 직접 수행하기 위해 최선을 다할 것입니다. 즉, 버퍼의 콘텐츠를 기본 운영 체제의 기본 I/O 작업 중 하나를 호출하기 전(또는 이후) 중간 버퍼에 복사하지 않도록 시도합니다. "
                .getBytes(StandardCharsets.UTF_8);

        buff = ByteBuffer.allocateDirect(data.length);
        buff.put(data).flip();
    }

    /* main() method unchanged */

    public static String bytesBenchmark() {
        return new String(data, StandardCharsets.UTF_8);
    }

    public static String bufferBenchmark() {
        return StandardCharsets.UTF_8.decode(buff).toString();
    }

and I got the result (openjdk 11.0.9.1, Intel Macbook Big Sur):

dave@zwingli benchmark % java -classpath . BytesVsBuffer
new String from bytes taken 7,110,828,188ns
new String from buffer taken 165,582,666ns

OK, so, to my patch - my patch is not allocating a new byte buffer with empty data; it allocates a new byte buffer around the pointers returned from SQLite.

Are you seeing a performance decrease when you use the sqlite-jdbc library, or is this just a best practice we should be aware of when doing similar work?

Jan 09 '21 18:01 tildedave

@tildedave , TO answer your question, YES, my test is on lib, my example code about just to show you where the problem is.

according to your modification, there is an issue when bufferBenchmark() runs more then once (buff is position == limit led to empty string). After modify your the code, i run it again

static {
   ....
        buff = ByteBuffer.allocateDirect(data.length);
        buff.put(data);
}
	public static String bufferBenchmark() {
		buff.flip();
		return StandardCharsets.UTF_8.decode(buff).toString();
	}

The result is new String from bytes taken 8,291,965,852ns new String from buffer taken 9,634,591,090ns

almost equals to my first report 15% performance declined, please check that, thanks.

Jan 10 '21 13:01 wuhongzhi

Ah, makes sense that the buffer needs to be reset each test run. Thanks.

My guess from the ByteBuffer#decode code you've shared is that it's re-decoding the bytes in the buffer as if they were UTF8, but we shouldn't need to do an extra decoding pass in Java since SQLite has already put the data into UTF8 format - we really just need to put the byte data into the Java string. I'll see if I can make a patch with that. Thanks again for all the info, and sorry about the performance regression.

Jan 10 '21 21:01 tildedave

Alright, beware of Java's UTF-8 in jni does not totally equal to SQLite's UTF-8, https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

detail disciss is here https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

Jan 11 '21 00:01 wuhongzhi

What's the consensus on this ? Is this still relevant?

Jul 29 '22 06:07 gotson

We merged https://github.com/xerial/sqlite-jdbc/pull/575 which had a slight improvement in the performance over the original merge. My memory is (but I guess I didn't write in on the Issue or Pull Request, sorry) that there was still something like a 5% performance degradation over the original behavior. @wuhongzhi had some more complicated C code in that PR that might be faster.

Aug 05 '22 00:08 tildedave

sqlite-jdbc sqlite-jdbc copied to clipboard

the merge from https://github.com/xerial/sqlite-jdbc/pull/564 led to performance lost

some code extract from sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)

every buffer read led to jni call java.nio.DirectByteBuffer.get()

sqlite-jdbc
sqlite-jdbc copied to clipboard