iceoryx The time-consuming problem of loan

Operating system: E.g. Ubuntu 20.04.4 LTS

Compiler version: E.g. gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

Observed result or behaviour: I found that the process of the loan below takes about 3-4 microseconds (Intel(R)cpu Xeon(R) Platinum 8369B CPU @ 2.70GHz), because iceorxy's code is more complicated for me, I didn't study it carefully, but I think this It should just copy some data to the shared memory, and then I think it shouldn't take a lot of time to publish. Is there any room for optimization in this time-consuming? Of course, it may also be a problem with my test. If it causes trouble to you, I am deeply sorry.

publisher->loan(len)
       .and_then([&](auto& userPayload) {
         memset(userPayload, 0, len);
         memcpy(userPayload, data, len);
         publisher->publish(userPayload);
       })  
       .or_else([this](auto& error) {
         std::ostringstream oss;
         oss << error;
         LOG_E("Unable to loan shm. {} [{}{}{}{}{}] [{}]", appName_, service_,
               SEP_OF_SHM_SVC, instance_, SEP_OF_SHM_SVC, event_, oss.str());
       });

Expected result or behaviour: The loan process should be faster.

Conditions where it occurred / Performed steps:

Aug 24 '22 01:08 byrnexu

@byrnexu

One problem is that you write to the same memory position twice which doubles the runtime. It is safe to remove memset(userPayload, 0, len); since in the next line you already copy the data.
You can further improve the performance by writing the data directly into the userPayload. I assume you have some code like this:

Data produceData();

auto data = produceData();
publisher->loan(len)
       .and_then([&](auto& userPayload) {
         memcpy(userPayload, data, len);
         publisher->publish(userPayload);
       })

But it would be much faster if you could get rid of this additional memcpy as well and change the implementation to something like:

void produceData(Data * data);

publisher->loan(len)
       .and_then([&](auto& userPayload) {
         produceData(userPayload);
         publisher->publish(userPayload);
       })

In my opinion and from the first look it seems the bottleneck is not loan or publish but the 2 additional but unnecessary copies with memset and memcpy which are extremely time consuming.

You can measure the single steps in detail if you adjust your code like this:

std::chrono::steady_clock::time_point beforeMem, afterMem, afterPublish;
auto start = std::chrono::steady_clock::now(); ‣: auto
publisher->loan(len)
    .and_then([&](auto& userPayload) {
        beforeMem = std::chrono::steady_clock::now();
        memset(userPayload, 0, len);
        memcpy(userPayload, data, len);
        afterMem = std::chrono::steady_clock::now();
        publisher->publish(userPayload);
        afterPublish = std::chrono::steady_clock::now();
    })
    .or_else([this](auto& error) {
        std::ostringstream oss;
        oss << error;
        LOG_E("Unable to loan shm. {} [{}{}{}{}{}] [{}]", appName_, service_,
              SEP_OF_SHM_SVC, instance_, SEP_OF_SHM_SVC, event_, oss.str());
    });

std::cout << "Time to loan: "
          << std::chrono::duration_cast<std::chrono::nanoseconds>(beforeMem - start).count()
          << "ns" << std::endl;
std::cout << "Time to write memory: "
          << std::chrono::duration_cast<std::chrono::nanoseconds>(afterMem - beforeMem).count()
          << "ns " << std::endl;
std::cout << "Time to publish: "
          << std::chrono::duration_cast<std::chrono::nanoseconds>(afterPublish - afterMem).count()
          << "ns " << std::endl;

If you do this measurement I would be curious to see the real numbers. If this solves your issue we should add this to our FAQ as a hint.

Aug 24 '22 10:08 elfenpiff

The machine and code I used as a test are all in the cloud, unfortunately my machine was released yesterday. Currently I have tested with a similar piece of code on my home pc, my home pc is not an intel processor, it is an AMD Ryzen 9 5950X 16-Core Processor, I found that the performance drops greatly, this drop is partly related to the processor relation. But from the code you provided, it doesn't take a lot of time to write memory. When using the interl processor, the whole process is about 3-4 microseconds. When switching to an AMD processor, the whole process takes more than 10 microseconds, but this is enough to show that the loan+publish process should take several microseconds. I want to know Is this time-consuming within the normal range of the data, and is there any room for improvement?

   std::chrono::steady_clock::time_point beforeMem, afterMem, afterPublish;
   auto start = std::chrono::steady_clock::now();
 
   publisher_->loan(shmBufLen)
       .and_then([&](auto& userPayload) {
         beforeMem = std::chrono::steady_clock::now();
 
         fillSHMBufCallback(userPayload);
 
         afterMem = std::chrono::steady_clock::now();
         publisher_->publish(userPayload);
         afterPublish = std::chrono::steady_clock::now();
       })
       .or_else([&](auto& error) {
         std::ostringstream oss;
         oss << error;
         LOG_E("Unable to loan shm. {} [{}] [{}]", appName_, publisherName_,
               oss.str());
       }); 
 
   auto timeToLoan =
       std::chrono::duration_cast<std::chrono::nanoseconds>(beforeMem - start)
           .count();
 
   auto timeToWriteMem =
       std::chrono::duration_cast<std::chrono::nanoseconds>(afterMem - beforeMem)
           .count();
 
   auto timeToPub = std::chrono::duration_cast<std::chrono::nanoseconds>(
                        afterPublish - afterMem)
                        .count();
 
   std::cout << "Time to loan: " << timeToLoan << "ns" << std::endl;
   std::cout << "Time to write memory: " << timeToWriteMem << "ns " << std::endl;
   std::cout << "Time to publish: " << timeToPub << "ns " << std::endl;

In order to reduce the impact of std::cout on the test results, I first calculated the time-consuming and then output. Here is the result of looping 30 times

Time to loan: 5350ns Time to write memory: 80ns Time to publish: 13330ns Time to loan: 3270ns Time to write memory: 60ns Time to publish: 11090ns Time to loan: 4480ns Time to write memory: 60ns Time to publish: 11150ns Time to loan: 1310ns Time to write memory: 60ns Time to publish: 11320ns Time to loan: 1390ns Time to write memory: 70ns Time to publish: 11390ns Time to loan: 1280ns Time to write memory: 70ns Time to publish: 11340ns Time to loan: 1290ns Time to write memory: 70ns Time to publish: 11630ns Time to loan: 1330ns Time to write memory: 60ns Time to publish: 11850ns Time to loan: 2060ns Time to write memory: 70ns Time to publish: 14440ns Time to loan: 1360ns Time to write memory: 70ns Time to publish: 11510ns Time to loan: 1600ns Time to write memory: 70ns Time to publish: 13360ns Time to loan: 1310ns Time to write memory: 70ns Time to publish: 11790ns Time to loan: 4270ns Time to write memory: 80ns Time to publish: 11200ns Time to loan: 1050ns Time to write memory: 60ns Time to publish: 10900ns Time to loan: 1081ns Time to write memory: 70ns Time to publish: 13900ns Time to loan: 1000ns Time to write memory: 70ns Time to publish: 11830ns Time to loan: 920ns Time to write memory: 70ns Time to publish: 11690ns Time to loan: 890ns Time to write memory: 70ns Time to publish: 11390ns Time to loan: 1190ns Time to write memory: 80ns Time to publish: 12950ns Time to loan: 970ns Time to write memory: 70ns Time to publish: 11610ns Time to loan: 840ns Time to write memory: 70ns Time to publish: 11370ns Time to loan: 960ns Time to write memory: 70ns Time to publish: 11560ns Time to loan: 1430ns Time to write memory: 70ns Time to publish: 15040ns Time to loan: 1050ns Time to write memory: 60ns Time to publish: 11880ns Time to loan: 1400ns Time to write memory: 50ns Time to publish: 12560ns Time to loan: 1150ns Time to write memory: 60ns Time to publish: 11960ns Time to loan: 1760ns Time to write memory: 70ns Time to publish: 14310ns Time to loan: 1300ns Time to write memory: 70ns Time to publish: 11410ns Time to loan: 1140ns Time to write memory: 60ns Time to publish: 12401ns

Aug 25 '22 08:08 byrnexu

@byrnexu Thoses loan and publish timings look a bit much. To get a better picture could you please answer me:

How many processes (threads) do you have running in your system concurrently?
How many subscribers does the publisher have?
Are those subscribers attached to a Waitset or a Listener?

Aug 25 '22 10:08 elfenpiff

@byrnexu Could you please run the iceperf example? This is a benchmark which tests iceoryx, the iceoryx c api, message queues and unix domain sockets. I ran it on my system and got the following numbers for iceoryx.

******      ICEORYX       ********
Waiting for: subscription, subscriber [ success ]
Measurement for: 1 kB, 2 kB, 4 kB, 8 kB, 16 kB, 32 kB, 64 kB, 128 kB, 256 kB, 512 kB, 1024 kB, 2048 kB, 4096 kB
Waiting for: unsubscribe  [ finished ]

#### Measurement Result ####
10000 round trips for each payload.

| Payload Size [kB] | Average Latency [µs] |
|------------------:|---------------------:|
|                 1 |                  1.2 |
|                 2 |                  1.9 |
|                 4 |                  1.9 |
|                 8 |                  1.9 |
|                16 |                    2 |
|                32 |                  1.9 |
|                64 |                  1.9 |
|               128 |                  1.9 |
|               256 |                  1.9 |
|               512 |                  1.9 |
|              1024 |                  1.9 |
|              2048 |                  1.9 |
|              4096 |                  1.9 |

Finished!

When the latency for a 4kb sample is around 1.9 microseconds I would assume that the loan and publish in your case above should not be larger than 1.9 microseconds in sum. Otherwise something is off. Either we did not cover this API / use case in the iceperf benchmark or something else in your system is stealing you some microseconds.

Either way it is a good idea to investigate this further.

Aug 25 '22 11:08 elfenpiff

Thank you very much for your kind reply.

How many processes (threads) do you have running in your system concurrently? 1 process.
How many subscribers does the publisher have? 1 subscriber
Are those subscribers attached to a Waitset or a Listener? attached to a Waitset

Below is the result of running ioxperf, it doesn't look slow

****** ICEORYX ******** Waiting for: subscription, subscriber [ success ] Measurement for: 1 kB, 2 kB, 4 kB, 8 kB, 16 kB, 32 kB, 64 kB, 128 kB, 256 kB, 512 kB, 1024 kB, 2048 kB, 4096 kB Waiting for: unsubscribe [ finished ]

Measurement Result

10000 round trips for each payload.

Payload Size [kB]	Average Latency [µs]
1	1.1
2	1.1
4	0.9
8	0.91
16	0.9
32	0.88
64	0.89
128	0.89
256	1.1
512	0.91
1024	1.3
2048	1.1
4096	0.9

Finished!

Aug 25 '22 14:08 byrnexu

Thank you very much for your kind reply.

@byrnexu Thank you for discovering some interesting issues :smile:

I have the suspicion that the Waitset is maybe the cause for the increase in the loan and publish runtime.

Could you use the subscriber without the waitset in a test run and send me the numbers of: Time to loan, Time to write memory and Time to publish again.

In the end the publisher performs a post call on a posix semaphore to inform the subscriber that a new sample has arrived and this may include some kind of expensive OS context switch. We experienced something similar with our keepalive heartbeat maybe we see it here again.

Aug 25 '22 14:08 elfenpiff

I closed the subscription side and got the following result

Time to loan: 1980ns Time to write memory: 70ns Time to publish: 14800ns Time to loan: 1570ns Time to write memory: 70ns Time to publish: 12890ns Time to loan: 820ns Time to write memory: 50ns Time to publish: 11410ns Time to loan: 1810ns Time to write memory: 60ns Time to publish: 14920ns

The result seems to be about the same，I suspect it has something to do with the processor or the environment of the test machine. My processor is AMD Ryzen 9 5950X 16-Core Processor, and I use another Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz processor. The test results on the machine are much better

Time to loan: 522ns Time to write memory: 41ns Time to publish: 2273ns Time to loan: 701ns Time to write memory: 42ns Time to publish: 2108ns Time to loan: 575ns Time to write memory: 41ns Time to publish: 2165ns Time to loan: 582ns Time to write memory: 42ns Time to publish: 2173ns Time to loan: 621ns Time to write memory: 46ns Time to publish: 2272ns

Sep 09 '22 10:09 byrnexu

@byrnexu I am late here I guess but what I see is that there is (roughly) at most a factor of 10 difference in publish time between different platforms. But the values are still in the low microseconds, which is acceptable and in line with other measurements (if it would jump to 100s of microseconds or milliseconds it would be more suspicious.

Now I can only guess the reason, but the queues used internally (in publish) rely on atomic operations ad potentially memory fences in certain cases. These can vary quite a bit between architectures, which could explain the differences (would be my first guess). Especially compare-exchange operations. I found a better benchmark post once, but this https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html also gives some impression about the order of magnitudes of certain atomic operations.

To get better data (if needed) the first step would be estimating mean (and maybe variance) from a sufficient number of publish operations (>10^6 I would say).

From the data I see I am not too concerned about the differences between platforms, but I could have missed something.

@elBoberido FYI

Sep 15 '22 16:09 MatthiasKillat

@MatthiasKillat thanks a lot for your answer.

Sep 21 '22 01:09 byrnexu

@byrnexu Closing this issue. Feel free to re-open if you have further questions.

Oct 06 '22 16:10 mossmaurice

iceoryx iceoryx copied to clipboard

The time-consuming problem of loan

Measurement Result

iceoryx
iceoryx copied to clipboard