sqlite-jdbc icon indicating copy to clipboard operation
sqlite-jdbc copied to clipboard

the merge from https://github.com/xerial/sqlite-jdbc/pull/564 led to performance lost

Open wuhongzhi opened this issue 4 years ago β€’ 11 comments

As in in thread 564 report, nativeByteBuffer to string is a slow operation as the decoder get byte one by one via JNI. In my test in a large table(1m rows), lost 15% compare to previous version.

src for open jdk bytebuffer to string in sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)


According to http://www.java-performance.com/ Performance of various methods of binary serialization in Java: java.nio.ByteBuffer, sun.misc.Unsafe, java.io.DataInputStream, java.io.DataOutputStream, java.io.ByteArrayInputStream, java.io.ByteArrayOutputStream: comparison of binary serialization performance using various classes:

RULE1:(same for reading) It is extremely slow to write single bytes to direct byte buffers. You should avoid using direct byte buffers for writing records with mostly single byte fields.

wuhongzhi avatar Dec 25 '20 04:12 wuhongzhi

note: direct bytebuffer has limit by jvm, -XX:MaxDirectMemorySize default to 64Mb in same jvm, but byte[] limit to 2Gb

wuhongzhi avatar Dec 25 '20 09:12 wuhongzhi

πŸ‘€

xerial avatar Dec 25 '20 18:12 xerial

Hey @wuhongzhi can you share your benchmarking code so I can run through it and confirm the performance loss? Also could you please include your JVM version?

PR #564 didn't add any new single-byte writes to a direct byte buffer - so it doesn't violate RULE 1 here. Instead, the patch allocates a new buffer around the existing data returned from SQLite via https://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-14.html#NewDirectByteBuffer. Then that data is read out of the ByteBuffer and converted to a String object in Java (NativeDB.java).

We've been running with the patch from #564 in production and our worst case read times have improved. Happy to take a look at this again if you're running into problems though.

tildedave avatar Jan 05 '21 21:01 tildedave

Hello @tildedave, I have a test on bytes vs buffer to simulate your patch

import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

public class BytesVsBuffer {
	private static final byte[] data;
	static {
		data = "λ°”μ΄νŠΈ λ²„νΌλŠ” 직접 λ˜λŠ” λΉ„ μ§μ ‘μž…λ‹ˆλ‹€. 직접 λ°”μ΄νŠΈ 버퍼λ₯Ό κ°μ•ˆν•  λ•Œ Java 가상 μ‹œμŠ€ν…œμ€ λ„€μ΄ν‹°λΈŒ I/O μž‘μ—…μ„ 직접 μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ μ΅œμ„ μ„ λ‹€ν•  κ²ƒμž…λ‹ˆλ‹€. 즉, λ²„νΌμ˜ μ½˜ν…μΈ λ₯Ό κΈ°λ³Έ 운영 체제의 κΈ°λ³Έ I/O μž‘μ—… 쀑 ν•˜λ‚˜λ₯Ό ν˜ΈμΆœν•˜κΈ° μ „(λ˜λŠ” 이후) 쀑간 버퍼에 λ³΅μ‚¬ν•˜μ§€ μ•Šλ„λ‘ μ‹œλ„ν•©λ‹ˆλ‹€. "
					.getBytes(StandardCharsets.UTF_8);
	}
	public static void main(String[] args) {
		int round = 10_000_000;
		long stime = System.nanoTime();
		for (int i = 0; i < round; i++) {
			bytesBenchmark();
		}
		long etime = System.nanoTime();
		System.out.println(String.format("new String from bytes taken %,dns", etime - stime));
		stime = System.nanoTime();
		for (int i = 0; i < round; i++) {
			bufferBenchmark();
		}
		etime = System.nanoTime();
		System.out.println(String.format("new String from buffer taken %,dns", etime - stime));
	}
	public static String bytesBenchmark() {
		return new String(Arrays.copyOf(data, data.length), StandardCharsets.UTF_8);
	}
	public static String bufferBenchmark() {
		ByteBuffer buff = ByteBuffer.allocateDirect(data.length);
		return StandardCharsets.UTF_8.decode(buff.put(data).flip()).toString();
	}
}

Results ROUND1 new String from bytes taken 6,880,143,752ns new String from buffer taken 18,389,513,317ns

ROUND2 new String from bytes taken 7,046,609,094ns new String from buffer taken 18,531,096,240ns

ROUND3 new String from bytes taken 6,802,205,509ns new String from buffer taken 19,774,656,708ns

My env is linux(ubuntu 20.4) openjdk-11

wuhongzhi avatar Jan 07 '21 00:01 wuhongzhi

some code extract from sun.nio.cs.UTF_8.Decoder.decodeBufferLoop(ByteBuffer, CharBuffer)

    private CoderResult decodeBufferLoop(ByteBuffer src,
                                         CharBuffer dst)
    {
        int mark = src.position();
        int limit = src.limit();
        while (mark < limit) {
            int b1 = src.get();
            if (b1 >= 0) {
                // 1 byte, 7 bits: 0xxxxxxx
                if (dst.remaining() < 1)
                    return xflow(src, mark, 1); // overflow
                dst.put((char) b1);
                mark++;
            } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
                // 2 bytes, 11 bits: 110xxxxx 10xxxxxx
                if (limit - mark < 2|| dst.remaining() < 1)
                    return xflow(src, mark, 2);
                int b2 = src.get();
                if (isNotContinuation(b2))
                    return malformedForLength(src, mark, 1);
                 dst.put((char) (((b1 << 6) ^ b2)
                                ^
                                (((byte) 0xC0 << 6) ^
                                 ((byte) 0x80 << 0))));
                mark += 2;
            } else if ((b1 >> 4) == -2) {
                // 3 bytes, 16 bits: 1110xxxx 10xxxxxx 10xxxxxx
                int srcRemaining = limit - mark;
                if (srcRemaining < 3 || dst.remaining() < 1) {
                    if (srcRemaining > 1 && isMalformed3_2(b1, src.get()))
                        return malformedForLength(src, mark, 1);
                    return xflow(src, mark, 3);
                }
                int b2 = src.get();
                int b3 = src.get();
                if (isMalformed3(b1, b2, b3))
                    return malformed(src, mark, 3);
                char c = (char)
                    ((b1 << 12) ^
                     (b2 <<  6) ^
                     (b3 ^
                      (((byte) 0xE0 << 12) ^
                       ((byte) 0x80 <<  6) ^
                       ((byte) 0x80 <<  0))));
                if (Character.isSurrogate(c))
                    return malformedForLength(src, mark, 3);
                dst.put(c);
                mark += 3;
            } else if ((b1 >> 3) == -2) {
                // 4 bytes, 21 bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                int srcRemaining = limit - mark;
                if (srcRemaining < 4 || dst.remaining() < 2) {
                    b1 &= 0xff;
                    if (b1 > 0xf4 ||
                        srcRemaining > 1 && isMalformed4_2(b1, src.get() & 0xff))
                        return malformedForLength(src, mark, 1);
                    if (srcRemaining > 2 && isMalformed4_3(src.get()))
                        return malformedForLength(src, mark, 2);
                    return xflow(src, mark, 4);
                }
                int b2 = src.get();
                int b3 = src.get();
                int b4 = src.get();
                int uc = ((b1 << 18) ^
                          (b2 << 12) ^
                          (b3 <<  6) ^
                          (b4 ^
                           (((byte) 0xF0 << 18) ^
                            ((byte) 0x80 << 12) ^
                            ((byte) 0x80 <<  6) ^
                            ((byte) 0x80 <<  0))));
                if (isMalformed4(b2, b3, b4) ||
                    // shortest form check
                    !Character.isSupplementaryCodePoint(uc)) {
                    return malformed(src, mark, 4);
                }
                dst.put(Character.highSurrogate(uc));
                dst.put(Character.lowSurrogate(uc));
                mark += 4;
            } else {
                return malformed(src, mark, 1);
            }
        }
        return xflow(src, mark, 0);
    }

every buffer read led to jni call java.nio.DirectByteBuffer.get()

public byte get() {
    try {
        return ((UNSAFE.getByte(ix(nextGetIndex()))));
    } finally {
        Reference.reachabilityFence(this);
    }
}

wuhongzhi avatar Jan 07 '21 01:01 wuhongzhi

Hi @wuhongzhi, thanks for the code.

Your byte buffer benchmark is doing three things: allocating the buffer, putting data into it, and then reading data out of it. The cost of allocating the buffer and putting data into it is why your benchmark is slower, not the decoding method.

If you remove the allocations and copies from your benchmarking code to isolate the decoding method, it ends up being much faster to read out of the ByteBuffer. What I used here was:

    private static final byte[] data;
    private static final ByteBuffer buff;

    static {
        data = "λ°”μ΄νŠΈ λ²„νΌλŠ” 직접 λ˜λŠ” λΉ„ μ§μ ‘μž…λ‹ˆλ‹€. 직접 λ°”μ΄νŠΈ 버퍼λ₯Ό κ°μ•ˆν•  λ•Œ Java 가상 μ‹œμŠ€ν…œμ€ λ„€μ΄ν‹°λΈŒ I/O μž‘μ—…μ„ 직접 μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ μ΅œμ„ μ„ λ‹€ν•  κ²ƒμž…λ‹ˆλ‹€. 즉, λ²„νΌμ˜ μ½˜ν…μΈ λ₯Ό κΈ°λ³Έ 운영 체제의 κΈ°λ³Έ I/O μž‘μ—… 쀑 ν•˜λ‚˜λ₯Ό ν˜ΈμΆœν•˜κΈ° μ „(λ˜λŠ” 이후) 쀑간 버퍼에 λ³΅μ‚¬ν•˜μ§€ μ•Šλ„λ‘ μ‹œλ„ν•©λ‹ˆλ‹€. "
                .getBytes(StandardCharsets.UTF_8);

        buff = ByteBuffer.allocateDirect(data.length);
        buff.put(data).flip();
    }

    /* main() method unchanged */

    public static String bytesBenchmark() {
        return new String(data, StandardCharsets.UTF_8);
    }

    public static String bufferBenchmark() {
        return StandardCharsets.UTF_8.decode(buff).toString();
    }

and I got the result (openjdk 11.0.9.1, Intel Macbook Big Sur):

dave@zwingli benchmark % java -classpath . BytesVsBuffer
new String from bytes taken 7,110,828,188ns
new String from buffer taken 165,582,666ns

OK, so, to my patch - my patch is not allocating a new byte buffer with empty data; it allocates a new byte buffer around the pointers returned from SQLite.

Are you seeing a performance decrease when you use the sqlite-jdbc library, or is this just a best practice we should be aware of when doing similar work?

tildedave avatar Jan 09 '21 18:01 tildedave

@tildedave , TO answer your question, YES, my test is on lib, my example code about just to show you where the problem is.

according to your modification, there is an issue when bufferBenchmark() runs more then once (buff is position == limit led to empty string). After modify your the code, i run it again

static {
   ....
        buff = ByteBuffer.allocateDirect(data.length);
        buff.put(data);
}
	public static String bufferBenchmark() {
		buff.flip();
		return StandardCharsets.UTF_8.decode(buff).toString();
	}

The result is new String from bytes taken 8,291,965,852ns new String from buffer taken 9,634,591,090ns

almost equals to my first report 15% performance declined, please check that, thanks.

wuhongzhi avatar Jan 10 '21 13:01 wuhongzhi

Ah, makes sense that the buffer needs to be reset each test run. Thanks.

My guess from the ByteBuffer#decode code you've shared is that it's re-decoding the bytes in the buffer as if they were UTF8, but we shouldn't need to do an extra decoding pass in Java since SQLite has already put the data into UTF8 format - we really just need to put the byte data into the Java string. I'll see if I can make a patch with that. Thanks again for all the info, and sorry about the performance regression.

tildedave avatar Jan 10 '21 21:01 tildedave

Alright, beware of Java's UTF-8 in jni does not totally equal to SQLite's UTF-8, https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

detail disciss is here https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

wuhongzhi avatar Jan 11 '21 00:01 wuhongzhi

What's the consensus on this ? Is this still relevant?

gotson avatar Jul 29 '22 06:07 gotson

We merged https://github.com/xerial/sqlite-jdbc/pull/575 which had a slight improvement in the performance over the original merge. My memory is (but I guess I didn't write in on the Issue or Pull Request, sorry) that there was still something like a 5% performance degradation over the original behavior. @wuhongzhi had some more complicated C code in that PR that might be faster.

tildedave avatar Aug 05 '22 00:08 tildedave