cyclonedds-cxx icon indicating copy to clipboard operation
cyclonedds-cxx copied to clipboard

c++ performance vs c performance

Open wjbbupt opened this issue 2 years ago • 20 comments

Experiments have found that the performance of c++ is slower than that of c by 2-3 times. Through the flame graph, we can see that the sequence on the pub side is replaced by cdr, which consumes a lot. On the sub side, deserialization is faster and take is faster, so I want to analyze it together. The following is flame graph。 c:pub: image

c:sub: image

c++ pub image c++:sub image What I want to confirm is that I suspect that the reason for the slow performance of c++ is:

  1. Slow serialization on the pub side leads to slow sending;
  2. Slow deserialization at the sub end leads to slow processing
  3. The receiving end checks whether the take thread and deserialization are not in the same thread, whether it is the influence between threads;
  4. When take is received, there is a lock mechanism and a limit of max=100 samples The above are some of my doubts, and I want to discuss and study them with you.

wjbbupt avatar Dec 28 '22 08:12 wjbbupt

Hi @wjbbupt ,

Could you give me an example of the code that you used to generate this flamegraph? That way I can run my own analysis and compare our thoughts.

reicheratwork avatar Jan 05 '23 09:01 reicheratwork

@reicheratwork Two versions, one is to replace the api with c++ for the source code throught in the c language version, and test; The other is this version, the picture of the test conclusion is above; here is currently a PR outstanding for C++ equivalents of CycloneDDS-C performance/example programs roundtrip (latency) and throughput: https://github.com/eclipse-cyclonedds/cyclonedds-cxx/pull/325

wjbbupt avatar Jan 10 '23 06:01 wjbbupt

@reicheratwork From the source code analysis, c++ read or writer serialization, deserialization part and c language use two sets of mechanisms, I am not sure if there is any historical reason, my understanding is that c++ should adjust the api of c up, I want to unify the two sets of APIs. In this case, the performance of C++ will be improved a lot. In fact, I found that there are too many nested things after modification, so I would like to ask why there were two sets of APIs in the first place. Is it because of language differences or design? Concept difference

wjbbupt avatar Jan 10 '23 06:01 wjbbupt

@reicheratwork Two versions, one is to replace the api with c++ for the source code throught in the c language version, and test; The other is this version, the picture of the test conclusion is above; here is currently a PR outstanding for C++ equivalents of CycloneDDS-C performance/example programs roundtrip (latency) and throughput: #325

Aha, thanks.

reicheratwork avatar Jan 10 '23 09:01 reicheratwork

@reicheratwork According to the actual test, the time-consuming of the serialization of the C version and the C++ version is about 4-5 times that of C. This is where the performance of the C++ version is worse than that of C. The deserialization part has not been tested yet. By observing the flame graph, it can be seen that the deserialization part also consumes a lot; I hope that we can discuss more about the reasons for the emergence. Some original design concepts may not be clear without participation, but this breakthrough will have a huge boost to performance; Looking forward to your reply, thanks!

wjbbupt avatar Jan 10 '23 09:01 wjbbupt

@reicheratwork From the source code analysis, c++ read or writer serialization, deserialization part and c language use two sets of mechanisms, I am not sure if there is any historical reason, my understanding is that c++ should adjust the api of c up, I want to unify the two sets of APIs. In this case, the performance of C++ will be improved a lot. In fact, I found that there are too many nested things after modification, so I would like to ask why there were two sets of APIs in the first place. Is it because of language differences or design? Concept difference

There are two things going on here.

  1. the differences between the CycloneDDS C and C++ APIs, where the C++ API is mostly just a thin layer translating C++ function calls to their C equivalents (dds::pub::DataWriter::write to dds_write etc.).
  2. the differences between the C and C++ IDL language bindings which translates the datatypes to be exchanged from their IDL definitions to their C/C++ equivalents.

I think the main performance penalties on the writing (serialization) side are from differences in implementation caused by the second point.

  • the C representation of types are simple structs, where the (de)serializers can directly access the type's members
  • the C++ representations are classes whose members can only be accessed through getters/setters

This requires C and C++ use two different types of serializers to translate the data from their language-native (C/C++) representation to their wire representation.

And because the C++ representations of the exchanged types are part of a specification (this one) I think it will be very difficult to adapt the C serializers to be used by the C++ binding.

About things being nested, I don't really understand what you mean by this, could you please expand on this?

reicheratwork avatar Jan 10 '23 09:01 reicheratwork

@wjbbupt I have been working on some performance improvements on the C++ serialization side, and will have a PR ready for that soon, could I count on you to also poke around and review it?

reicheratwork avatar Jan 10 '23 09:01 reicheratwork

@reicheratwork The premise of the test is the same idle; Writer-side differences: Mainly reflected in: ddsi_serdata_from_sample(). Its role is to implement serialization. The implementations below c and c++ are different, and the difference in test performance is 4-5 times.

Differences on the read side: Receive function c: dds_take(), c++ dds_takecdr().

Therefore, the difference on the pub side is mainly reflected in the serialization. According to your conclusion, it is to adapt to the language difference, but if you check the read() and take() in the sub calculation, you will find that c and c++ use different sets interface

wjbbupt avatar Jan 10 '23 10:01 wjbbupt

@wjbbupt I have been working on some performance improvements on the C++ serialization side, and will have a PR ready for that soon, could I count on you to also poke around and review it?

If you can, I can use your PR to do some demo tests. I am very willing to improve the performance. We have a common goal, thank you

wjbbupt avatar Jan 10 '23 10:01 wjbbupt

@wjbbupt : could you give #361 a try and see how it improves the performance?

reicheratwork avatar Jan 10 '23 16:01 reicheratwork

@reicheratwork could you give https://github.com/eclipse-cyclonedds/cyclonedds-cxx/pull/361 a try and see how it improves the performance? Before modification: c 7us, c++ 31us; After modification: Serialization takes 10us faster than before. The problems that arise are:

  1. It is not compatible with release0.10.2, I tried it, but it cannot be compiled;
  2. The modification efficiency is not as good as that of release 0.10.2,
  3. At that time, the release version was different from the master version in many ways;

wjbbupt avatar Jan 11 '23 11:01 wjbbupt

I have a suggestion;

  1. The latest version of the release, the research proves that the performance is better than the master, can it be optimized based on the release version;
  2. Now the serialization is being optimized. In fact, reverse serialization of the receiving end with max_sample=100 is also an optimization point. I am also thinking about starting with that for better results.

wjbbupt avatar Jan 11 '23 11:01 wjbbupt

@reicheratwork I am also integrating the optimized version with the release version. Unfortunately, the compilation problem has too many modifications, and it has not been resolved yet.

wjbbupt avatar Jan 11 '23 11:01 wjbbupt

@wjbbupt I ran everything through valgrind's callgrind and did a little comparison between the C and C++ performance: callgrind_files.zip (you can view these files in kcachegrind)

I came to the following conclusions regarding the performance (excluding putting the data on the network, setting values in samples by the program before writing, etc.) for the parts that "really differ" between the C and C++ implementations:

C receiving side:

  • ddsi_serdata_from_ser:
    • called for each sample
    • ~10000 instructions/sample
  • ddsi_serdata_to_sample:
    • called for each sample
    • ~1700 instructions/sample

C publishing side:

  • ddsi_serdata_from_sample:
    • called for each sample
    • ~3200 instructions/sample

C++ receiving side:

  • ddsi_serdata_from_ser:
    • called for each sample
    • already converts to user-representation
    • ~26000 instructions/sample
  • ddsi_serdata_to_sample:
    • called only a few times, does not impact performance

C++ publishing side:

  • ddsi_serdata_from_sample:
    • called for each sample
    • ~9300 instructions/sample

The main analysis (per sample):

  • receiving:
    • C:
      • ~8000 instructions are spent copying the network fragments to a contiguous buffer
      • ~750 instructions are spent moving the data into the user-readable format
    • C++:
      • ~8000 instructions are spent copying the network fragments to a contiguous buffer
      • ~15000 instructions are spent translating the data into the user-readable format
        • of which ~8000 are spent initializing the sequence member at a certain size (std::vector::resize)
        • of which ~2000 are spent checking the completeness of the struct (std::set)
  • sending:
    • C:
      • ~2200 instructions are spent serializing the sample
    • C++:
      • ~5000 instructions are spent serializing the sample
        • ~2300 instructions copying the data into the sequence
      • ~2300 instructions are spent copying the sample into the serdata

Main places for improvements in C++ (in my eyes):

  1. remove unnecessary checks for struct completeness in deserializing final types (as this is pointless)
  2. remove default initialization for reading sequences of base types (custom allocator for std::vector?)
  3. simplify checks for whether the buffer has not ran out yet (cdr_stream::bytes_available)

@wjbbupt could you have a look at these callgrind files and give some of your insights? Or maybe create your own and compare them to mine?

reicheratwork avatar Jan 12 '23 14:01 reicheratwork

@wjbbupt I made some changes to the deserialization code, this should now copy base types by calling std::vector::assign in stead of std::vector::resize and memcpy

Give that a look?

reicheratwork avatar Jan 23 '23 09:01 reicheratwork

@wjbbupt made some small changes, as the previous improvement neglected to do endianness correction after the sequence of base types copy

reicheratwork avatar Jan 26 '23 09:01 reicheratwork

@reicheratwork
I am optimizing the performance of c++, and some problems have been found in local tests. In actual tests, the performance has been improved by about 10%; The problem found is this:

  1. Through the flame graph, we found that the finish_member() function frequently inserts data like set;
  2. Through the code generator, it is found that only when the mode is read, the finish_member() operation will be triggered. Other move max write modes are invalid modes. Therefore, I modified all the changes in the finish_member design in the code generator to read,
  3. The actual test throughput is 380Mbit/s before modification and 500Mbit/s after modification; There is no problem in checking the logic through the code. I wanted to submit a modification record to the community. Let me first express whether there is any problem with my modification; Looking forward to your reply as soon as possible, thank you

wjbbupt avatar Feb 08 '23 09:02 wjbbupt

@reicheratwork I am optimizing the performance of c++, and some problems have been found in local tests. In actual tests, the performance has been improved by about 10%; The problem found is this:

  1. Through the flame graph, we found that the finish_member() function frequently inserts data like set;
  2. Through the code generator, it is found that only when the mode is read, the finish_member() operation will be triggered. Other move max write modes are invalid modes. Therefore, I modified all the changes in the finish_member design in the code generator to read,
  3. The actual test throughput is 380Mbit/s before modification and 500Mbit/s after modification; There is no problem in checking the logic through the code. I wanted to submit a modification record to the community. Let me first express whether there is any problem with my modification; Looking forward to your reply as soon as possible, thank you

The insert on the std::set in the finish_member function is done to later check the struct's completeness, which is necessary when reading an @appendable or @mutable datatype, this check is now only done when it is necessary, check commit de4843d in PR #361 I also did some other performance improvements on the same PR (incremental vs. pre-calculated buffer assignment), you could give this a look as well

reicheratwork avatar Feb 08 '23 13:02 reicheratwork

@reicheratwork The insert on the std::set in the finish_member function is done to later check the struct's completeness, which is necessary when reading an @appendable or @mutable datatype, this check is now only done when it is necessary, check commit https://github.com/eclipse-cyclonedds/cyclonedds-cxx/commit/de4843dcc5294fbad5ad50703961ae91f0c5ff95 in PR https://github.com/eclipse-cyclonedds/cyclonedds-cxx/pull/361

I have already studied this, and the test results show that there is still a big performance gap between c++ and c, and there must be room for optimization. I tested and found that to do one serialization, the write() function in helloworld.hpp generated by helloworld.idl will be called 5 times. Frequent calls will cause many functions inside to be executed multiple times, which will slow down. I will sort out this process again , I don’t know if it should be like this or there is a problem with the coding architecture. My understanding should be called once;

wjbbupt avatar Feb 09 '23 08:02 wjbbupt

@reicheratwork The insert on the std::set in the finish_member function is done to later check the struct's completeness, which is necessary when reading an @appendable or @mutable datatype, this check is now only done when it is necessary, check commit de4843d in PR #361

I have already studied this, and the test results show that there is still a big performance gap between c++ and c, and there must be room for optimization. I tested and found that to do one serialization, the write() function in helloworld.hpp generated by helloworld.idl will be called 5 times. Frequent calls will cause many functions inside to be executed multiple times, which will slow down. I will sort out this process again , I don’t know if it should be like this or there is a problem with the coding architecture. My understanding should be called once;

You are correct that the differences in performance are disappointing, and maybe there are some differences in the C and C++ throughput examples that can explain them, I will examine this

reicheratwork avatar Feb 17 '23 16:02 reicheratwork