cyclonedds-cxx icon indicating copy to clipboard operation
cyclonedds-cxx copied to clipboard

Performance comparison between dds and dds-cxx throughput (dds-cxx is much worse)

Open YeahhhhLi opened this issue 2 years ago • 7 comments

ubuntu: 20.04.1 dds version: 0.9.1 dds-cxx version : 0.9.1 Tested on a single host, enter the command on a terminal:

./bin/ThroughputPublisher

And the other terminal:

./bin/ThroughputSubscriber 

Using the throughput tool that comes with dds, the test results are as follows:

Cycles: 0 | PollingDelay: -1 | Partition: Throughput example
=== [Subscriber] Waiting for samples...
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 201714 samples, 1654054800 bytes | Out of order: 0 samples Transfer rate: 201570.69 samples/s, 13222.60 Mbit/s
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 383084 samples, 3141288800 bytes | Out of order: 0 samples Transfer rate: 181228.85 samples/s, 11888.07 Mbit/s
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 555829 samples, 4557797800 bytes | Out of order: 0 samples Transfer rate: 172634.72 samples/s, 11324.03 Mbit/s
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 728681 samples, 5975184200 bytes | Out of order: 0 samples Transfer rate: 172673.59 samples/s, 11326.56 Mbit/s
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 891710 samples, 7312022000 bytes | Out of order: 0 samples Transfer rate: 162908.28 samples/s, 10686.08 Mbit/s
=== [Subscriber] 1.001 Payload size: 8192 | Total received: 1044720 samples, 8566704000 bytes | Out of order: 0 samples Transfer rate: 152891.31 samples/s, 10029.48 Mbit/s

Since there is no throughput example on cxx, refer to dds and the api provided by cxx to test after handwriting:

[Subscriber] Create reader.
[Subscriber] Wait for message.
[Subscriber] Interval[1.00 s] Samples[54829.58 counts] MsgSize[8192.00 bytes] Speed[3426.85 Mbits/s]
[Subscriber] Interval[1.00 s] Samples[54337.14 counts] MsgSize[8192.00 bytes] Speed[3396.07 Mbits/s]
[Subscriber] Interval[1.02 s] Samples[49566.60 counts] MsgSize[8192.00 bytes] Speed[3097.91 Mbits/s]
[Subscriber] Interval[1.01 s] Samples[59811.39 counts] MsgSize[8192.00 bytes] Speed[3738.21 Mbits/s]
[Subscriber] Interval[1.00 s] Samples[51487.54 counts] MsgSize[8192.00 bytes] Speed[3217.97 Mbits/s]
[Subscriber] Interval[1.01 s] Samples[61694.57 counts] MsgSize[8192.00 bytes] Speed[3855.91 Mbits/s]
[Subscriber] Interval[1.01 s] Samples[60861.33 counts] MsgSize[8192.00 bytes] Speed[3803.83 Mbits/s]
[Subscriber] Interval[1.01 s] Samples[64443.52 counts] MsgSize[8192.00 bytes] Speed[4027.72 Mbits/s]
[Subscriber] Interval[1.00 s] Samples[62247.15 counts] MsgSize[8192.00 bytes] Speed[3890.45 Mbits/s]

According to the results shown by the subscriber, it can be seen that the performance of the two is 3-4 times worse.

How to solve this problem? I can provide cxx throughput source code if needed

YeahhhhLi avatar Dec 05 '22 09:12 YeahhhhLi

c++ batch,but i didn't solve it

wjbbupt avatar Dec 08 '22 08:12 wjbbupt

@YeahhhhLi @wjbbupt The "batching" has a very big effect for really small samples, because only then the overhead of packet processing becomes dominant. A very unscientific experiment using ddsperf on my laptop (x1000 samples/s):

 size  batch   no-batch
    0    362       2740
  100    360       2565
 1000    328       1295
10000    230        230

so for the 8k case, I think it stands to reason that it wouldn't solve it.

Could you perhaps make a FlameGraph so we can see where the time goes?

eboasson avatar Dec 08 '22 14:12 eboasson

@eboasson I want to integrate batch on c++, but I see that there is no binding on c++, how do you do it here?

wjbbupt avatar Dec 09 '22 01:12 wjbbupt

@YeahhhhLi , there is currently a PR outstanding for C++ equivalents of CycloneDDS-C performance/example programs roundtrip (latency) and throughput: PR 325

Could you give those a try and tell us what kind of results you get then? (and review comments would also be very appreciated, of course)

reicheratwork avatar Dec 13 '22 13:12 reicheratwork

@wjbbupt Sorry, ignore the deleted post please, as you already have an issue open on it here

reicheratwork avatar Dec 13 '22 15:12 reicheratwork

I tried the code in the link you posted a long time ago. The test result is that it has lower performance than the c version without batching, but after adding batch, the performance does not change much. Theoretically, the performance gap between c and c++ It should not be very large, the actual measured data gap is 2-4 times (sample/s), It is for this reason that I asked why the performance gap between c and c++ is so large. Of course, you can try it to know what I mean; So yours didn't solve the problem, but thanks for your answer anyway

wjbbupt avatar Dec 13 '22 16:12 wjbbupt

Thanks for reply, lets try it later

YeahhhhLi avatar Dec 15 '22 06:12 YeahhhhLi