barchart-udt
barchart-udt copied to clipboard
benchmarks
@CCob in case you care to look, I started putting caliper benchmarks here https://github.com/barchart/netty-udt/tree/bench/bench this http://microbenchmarks.appspot.com/user/[email protected]/ tcp.NativeXferBench udt.NativeXferBench shows that crossing into JNI costs 10 times more (500 ns vs 5000 ns) for UDT then for TCP; if you have any ideas - please let me know :-)
It could be to do with the slow start algorithm inside UDT. Have you tried delaying the timing routine until the slow start part is over. Not sure if it would be easy to determine that from Java. You could just start transferring data for say 5-10 seconds, then start the actual benchmark
good point. I will take a look.
I noticed you have made a 2.2 release with some benchmark changes prior, was the slow start the culprit?
-
no, release is driven by netty
-
"slow start the culprit" is still under review
more food for thought for you: this bench https://github.com/barchart/netty-udt/blob/master/transport-udt/src/test/java/io/netty/transport/udt/bench/xfer/UdtNative.java
results show: http://microbenchmarks.appspot.com/run/[email protected]/io.netty.transport.udt.bench.xfer.UdtNative
that netty does fulfill its promise and gives 20 MB/sec bandwidth with 30 ms network latency and 100K sized messages.
I looked at latency from 0 to 500 ms, 200 MB/sec @ 0 ms becomes 20 MB/sec @ 5 ms and stays that way till 500 ms, then starts to decline again slowly.
however it brings questions:
-
how can we bring up plateau/limit above 20 MB/sec?
-
how can we improve performance for small message sizes?
I have done some benchmarks of my own, and it seems there are definite issues with performance. I have compared the output from Java appclient to the equivalent C++ app from the UDT library and it seems the CWnd on Java remains very low in comparison to the C++ version and usPktSndPeriod is much higher in Java than the C++ counterpart.
I'm looking into it further and will let you know what I find.
great. thank you for the update.
I think your original theory of crossing the JNI boundary might be correct. I have a feeling that the latency involved, especially when using byte[] vs ByteBuffer JNI send function is affecting UDT's congestion control. I'm looking through the OpenJDK now to see how it deals with send/recv function calls, but it wouldn't surprise me if HotSpot actually doesn't use JNI for those calls and does some inline JIT code when it sees calls to the native send/recv functions, in a similar fashion as it deals with put calls on direct ByteBuffers.
http://hg.openjdk.java.net/jdk6/jdk6-gate/jdk/file/f4bdaaa86ea8/src/windows/native/java/net/SocketOutputStream.c
Here is OpenJDK's implementation of OutputStream over a socket, which seems pretty standard to be honest, so at this point in time I am a little unsure of reasons why TCP performs better.
hmm... when you checked appserver+appclient : c++ vs java - did you build them with same options as NAR uses?
BTW I just remembered another possible performance issue: udt is pig and allocates 2 native threads for each socket (snd/rcv queue) question: do they have small/easy/portable c++ thread pool lib for that?
No, the compile options where the default from the UDT sources.
In regards to the threads, at the moment I am only testing 1<-->1 connection, so I don't think that is the issue, but it certainly wont scale well for 100's of connections over UDT.
Once thing I have noticed is that the default send buffer size is 64k, which means this is the maximum you will be able to transfer in one go from Java->C++ before returning back to Java. I will try to increase this to a larger buffer tomorrow to see if this has an affect on performance.
Interestingly enough, I used the C++ appclient and Java appserver and performance was the same as using the C++ appserver, so the bottleneck is in the sender not the receiver which also points to buffer size, since the default receive buffer size is in excess of 10MB on a UDT socket.which means you are transferring much more data in one JNI call when reading from a socket vs writing to it.
great info; I will also look into this.
I have found the cause of the performance issue I was seeing on Windows. It's related to Windows using the Fast IO path or not based on the data gram packet size configured in the Windows registry key HKLM\System\CurrentControlSet\Services\Afd\Parameters\FastSendDatagramThreshold.
UDT defaults to 1500. This causes UDT's congestion control problems due to buffering inside windows due to it not using the Fast IO path within Windows.
By changing the MSS option on the UDT socket to a smaller value I get much better performance.
socket.socketUDT().setOption(OptionUDT.UDT_MSS, 1052);
The big clue was in the appclient.cpp from the UDT code.
With the above in place, on a local LAN I can get 400Mb/s using the Java appclient and 460Mb/s using the native C++ counterpart.
Analyzing the CPU, in both cases it maxes 100% CPU on a single core due to the single SndQueue thread managing the congestion control and packet sending.
So it seams that there is around a 15% decrease in performance when using Java, and I imagine this is down the the JNI layer transitions and conversion from byte[] and ByteBuffers to char* in C++. So the smaller and more frequent transfers to the JNI layer is likely to cause more issues than larger application level buffers. But I have not confirmed this to be the case.
So at this stage I am happy with a 15% decrease vs UDT's native counterpart, and at this stage can't see how we can improve on it using the currently methodology of utilizing UDT at the C++ layer
re: "OptionUDT.UDT_MSS, 1052" silly me - I actually already run into this but then forgot! :-)
should we put "OptionUDT.UDT_MSS, 1052" by default when detecting windows?
re: "Analyzing the CPU, in both cases it maxes 100% CPU" what do you use to profile c++?
re: "at this stage I am happy with a 15% decrease" - did you try direct buffers instead of arrays? there is no copy involved with direct buffers.
re: "on a local LAN I can get 400Mb/s" did you try to introduce delays? see TrafficControl.java
Hmm good question. Well you could get further performance out of Windows by adjusting the FastSendDatagramThreshold which means you'll get the best of both worlds. So if you force the MSS to 1052 you wont see any benefit when setting FastSendDatagramThreshold to something higher.
Perhaps the default should be to set it to 1052, and maybe a property you can set to override this behavior for users who have updated the FastSendDatagramThreshold value.
"what do you use to profile c++" - in this particular case I simply used performance monitor on Windows. But any analysis for specific hot spots I use AMD CodeAnalyst.
"did you try direct buffers instead of arrays" - Yes, I updated the appclient to use direct ByteBuffers instead of byte[], in fact I had already done that before finding the MSS issue, so it maybe worse with byte[].
"did you try to introduce delays" - No, this was a simple 1<---->1 connection.
Do you know of any Windows equivalent to 'tc' so that I can test C++/Java appclient with latency introduced.
I'll give WANem a try since it's a LiveCD and no installation is necessary.
wanem probably easiest to get started. this is probably more current or try cisco nist or make own linux with netem . or mess with msvc newt.
got an answer from Yunhong Gu re: "udt is pig and allocates 2 native threads for each socket":
You can share these sockets on the same port, unless you have a reason not to.
Two threads are created for each UDP port your open, not for each UDT socket.
Thus, you can run 110 sockets with only two threads.
Create socket with UDT_REUSEADDR option, then explicitly bind sockets on the same port.
So does that ring true for server sockets too, when you accept does it automatically create a new UDP binding to a different UDP port or does it bind to the same port as the server is listening on?
I don't have concrete numbers yet, but using a VM running WANulator the appclient for C++ and Java seem to perform roughly the same when latency gets introduced, below are tests where WANulator had latency of around 180, which is roughly what I get for pings to servers in LA from the UK. First is the C++ appclient followed by Java appclient
SendRate(Mb/s) RTT(ms) CWnd PktSndPeriod(us) RecvACK RecvNAK WT(us)
1.49745 107.165 140 1 5 0 10472
49.3417 148.746 4843 68 12 15 126456
71.8588 164.734 9034 111 5 4 26502
84.2646 178.534 16821 101.5 127 0 7780
92.2888 178.659 16155 93 123 0 8888
89.2824 178.448 17440 97.5 101 1 8971
85.266 178.547 14915 102 105 1 9431
97.3497 178.815 16057 74 119 0 8824
88.4056 178.791 14981 98.5 97 3 9121
70.3212 178.421 12994 126.5 128 5 10943
75.306 178.57 13172 112.5 168 0 10939
83.1319 179.047 14637 102.5 133 0 9876
98.7295 178.668 16231 69 123 0 8772
67.887 180.442 15589 101 29 41 7070
46.8308 180.849 13801 144 9 5 21608
67.2912 178.166 12806 123.5 164 0 9854
76.8257 178.373 14696 110.5 155 0 10763
84.9564 178.582 14610 100 131 0 9724
92.6469 179.613 13664 106 94 3 8974
87.9044 178.706 15502 97 113 0 9140
106.947 179.072 17671 63 125 0 8198
120.859 179.082 18926 77.5 108 7 6596
120.292 179.815 15440 71.5 149 0 6856
58.1574 180.32 13998 103.5 3 5 10724
69.4597 179.003 16476 112 70 1 16540
83.5923 178.548 14661 102 129 0 9760
121.404 179.253 19054 64 124 1 7612
108.217 179.051 12081 87 106 5 6836
91.3212 179.24 10325 106 88 3 8781
72.6415 180.456 7750 143.5 91 10 10699
67.1196 179.62 11208 124 158 0 12297
81.2635 179.056 17762 84.5 153 0 10558
SendRate(Mb/s) RTT(ms) CWnd PktSndPeriod(us) RecvACK RecvNAK WT(us)
1.328 126.271 139 1.00 3 0 576
58.438 166.433 5688 98.00 10 27 870019
88.962 179.502 13754 130.00 94 3 47683
50.470 179.859 11765 167.00 71 4 70412
55.565 179.745 12501 144.00 102 0 77511
64.542 179.468 13428 111.00 102 0 68313
103.567 179.973 16219 94.00 76 6 45019
95.876 179.650 17046 86.00 102 0 43076
72.805 179.304 13858 114.00 83 3 54406
76.009 179.207 14465 119.00 88 1 54885
75.937 179.206 14406 122.00 99 1 56273
73.538 179.177 16069 125.00 87 1 57099
72.283 179.612 15761 113.00 89 0 58004
101.837 180.181 13916 73.00 102 4 47394
107.397 180.317 15103 79.00 68 7 39170
112.882 180.095 16755 84.00 103 1 37799
94.069 179.254 18006 88.00 86 3 43508
101.700 179.876 15258 151.00 103 5 41793
59.556 179.229 14828 135.00 81 3 63533
67.658 179.133 13606 120.00 98 0 63461
75.546 178.727 14843 108.00 106 0 56936
83.834 179.274 15462 98.00 102 0 51374
80.561 179.166 15729 103.00 87 2 52059
95.743 179.098 17033 69.00 103 0 47579
128.168 181.139 13627 86.00 66 32 33337
81.960 179.263 15143 102.00 82 2 49742
88.537 178.824 16469 93.00 104 0 48361
85.842 179.647 17032 110.00 84 2 48394
81.238 179.262 17335 101.00 86 0 51383
98.530 179.470 17658 66.00 103 0 46838
69.402 179.635 17015 114.00 13 15 47876
64.303 181.080 16738 110.00 32 0 91858
81.962 179.398 16997 100.00 101 0 52239
89.989 179.099 16303 91.00 105 0 47614
88.076 179.121 15952 95.00 85 1 47495
92.841 179.832 15597 95.00 89 4 45191
76.913 179.314 14689 112.00 87 2 53004
80.979 179.620 16529 101.00 104 0 52854
The switching capability of the VM is pretty poor on a laptop in comparison to some real server hardware, so on Monday I will try with some real hardware to see where we go.
re: "had latency of around 180" - lets agree on common latency ladder for benchmarks?
re: "SendRate(Mb/s)" - is it bytes or bits per second?
re: "using a VM running WANulator" - yes, I think using vm is no good for matchmaking.
how about increments of 100ms from 0 to 500ms?
the rate in in bits, poor laptop and VM, what can I say
re: "true for server sockets too?" answer:
accept() socket reuse the same port of the listen() socket.
UDT_REUSEADDR applies to rendezvous socket too.
Got an update for you on a real machine running WANulator.
- Running iperf (TCP performance app) I get around 650Mb/s bandwidth (poor desktop switch, so cannot reach the full 1Gb/s)
- Running both C++/Java appclient without latency, they behave the same with roughly 400Mb/s bandwidth (with the 1052 MSS limit due to Windows), a little less on the Java version
- Adding 500ms latency, iperf performance drops to 1.14Mb/s (shocking)
- With the 500ms latency both Java and C++ versions of appclient cap out at 127-130Mb/s
I also tried at 400ms, 300ms, 200ms and 100ms with increases in bandwidth at each drop. Although when I come down to the 100ms mark, it is already close to max rate for my setup at 0ms so there is little change between 0-150ms. So I'm not quite sure how you are seeing such a bad performance drop with 20ms latency at present,