barchart-udt benchmarks

@CCob in case you care to look, I started putting caliper benchmarks here https://github.com/barchart/netty-udt/tree/bench/bench this http://microbenchmarks.appspot.com/user/[email protected]/ tcp.NativeXferBench udt.NativeXferBench shows that crossing into JNI costs 10 times more (500 ns vs 5000 ns) for UDT then for TCP; if you have any ideas - please let me know :-)

Jan 05 '13 04:01 carrot-garden

It could be to do with the slow start algorithm inside UDT. Have you tried delaying the timing routine until the slow start part is over. Not sure if it would be easy to determine that from Java. You could just start transferring data for say 5-10 seconds, then start the actual benchmark

Jan 05 '13 10:01 CCob

good point. I will take a look.

Jan 05 '13 15:01 carrot-garden

I noticed you have made a 2.2 release with some benchmark changes prior, was the slow start the culprit?

Jan 07 '13 11:01 CCob

no, release is driven by netty
"slow start the culprit" is still under review

more food for thought for you: this bench https://github.com/barchart/netty-udt/blob/master/transport-udt/src/test/java/io/netty/transport/udt/bench/xfer/UdtNative.java

results show: http://microbenchmarks.appspot.com/run/[email protected]/io.netty.transport.udt.bench.xfer.UdtNative

that netty does fulfill its promise and gives 20 MB/sec bandwidth with 30 ms network latency and 100K sized messages.

I looked at latency from 0 to 500 ms, 200 MB/sec @ 0 ms becomes 20 MB/sec @ 5 ms and stays that way till 500 ms, then starts to decline again slowly.

however it brings questions:

how can we bring up plateau/limit above 20 MB/sec?
how can we improve performance for small message sizes?

Jan 07 '13 18:01 carrot-garden

I have done some benchmarks of my own, and it seems there are definite issues with performance. I have compared the output from Java appclient to the equivalent C++ app from the UDT library and it seems the CWnd on Java remains very low in comparison to the C++ version and usPktSndPeriod is much higher in Java than the C++ counterpart.

I'm looking into it further and will let you know what I find.

Jan 10 '13 15:01 CCob

great. thank you for the update.

Jan 10 '13 15:01 carrot-garden

I think your original theory of crossing the JNI boundary might be correct. I have a feeling that the latency involved, especially when using byte[] vs ByteBuffer JNI send function is affecting UDT's congestion control. I'm looking through the OpenJDK now to see how it deals with send/recv function calls, but it wouldn't surprise me if HotSpot actually doesn't use JNI for those calls and does some inline JIT code when it sees calls to the native send/recv functions, in a similar fashion as it deals with put calls on direct ByteBuffers.

Jan 10 '13 17:01 CCob

http://hg.openjdk.java.net/jdk6/jdk6-gate/jdk/file/f4bdaaa86ea8/src/windows/native/java/net/SocketOutputStream.c

Here is OpenJDK's implementation of OutputStream over a socket, which seems pretty standard to be honest, so at this point in time I am a little unsure of reasons why TCP performs better.

Jan 10 '13 17:01 CCob

hmm... when you checked appserver+appclient : c++ vs java - did you build them with same options as NAR uses?

Jan 10 '13 19:01 carrot-garden

BTW I just remembered another possible performance issue: udt is pig and allocates 2 native threads for each socket (snd/rcv queue) question: do they have small/easy/portable c++ thread pool lib for that?

Jan 10 '13 19:01 carrot-garden

No, the compile options where the default from the UDT sources.

In regards to the threads, at the moment I am only testing 1<-->1 connection, so I don't think that is the issue, but it certainly wont scale well for 100's of connections over UDT.

Once thing I have noticed is that the default send buffer size is 64k, which means this is the maximum you will be able to transfer in one go from Java->C++ before returning back to Java. I will try to increase this to a larger buffer tomorrow to see if this has an affect on performance.

Interestingly enough, I used the C++ appclient and Java appserver and performance was the same as using the C++ appserver, so the bottleneck is in the sender not the receiver which also points to buffer size, since the default receive buffer size is in excess of 10MB on a UDT socket.which means you are transferring much more data in one JNI call when reading from a socket vs writing to it.

Jan 10 '13 23:01 CCob

great info; I will also look into this.

Jan 10 '13 23:01 carrot-garden

I have found the cause of the performance issue I was seeing on Windows. It's related to Windows using the Fast IO path or not based on the data gram packet size configured in the Windows registry key HKLM\System\CurrentControlSet\Services\Afd\Parameters\FastSendDatagramThreshold.

UDT defaults to 1500. This causes UDT's congestion control problems due to buffering inside windows due to it not using the Fast IO path within Windows.

By changing the MSS option on the UDT socket to a smaller value I get much better performance.

socket.socketUDT().setOption(OptionUDT.UDT_MSS, 1052);

The big clue was in the appclient.cpp from the UDT code.

Jan 11 '13 16:01 CCob

With the above in place, on a local LAN I can get 400Mb/s using the Java appclient and 460Mb/s using the native C++ counterpart.

Analyzing the CPU, in both cases it maxes 100% CPU on a single core due to the single SndQueue thread managing the congestion control and packet sending.

So it seams that there is around a 15% decrease in performance when using Java, and I imagine this is down the the JNI layer transitions and conversion from byte[] and ByteBuffers to char* in C++. So the smaller and more frequent transfers to the JNI layer is likely to cause more issues than larger application level buffers. But I have not confirmed this to be the case.

So at this stage I am happy with a 15% decrease vs UDT's native counterpart, and at this stage can't see how we can improve on it using the currently methodology of utilizing UDT at the C++ layer

Jan 11 '13 16:01 CCob

re: "OptionUDT.UDT_MSS, 1052" silly me - I actually already run into this but then forgot! :-)