xtensor icon indicating copy to clipboard operation
xtensor copied to clipboard

Dividing xtensor by a scalar is x20-40 times slower than using std::transform

Open vakokako opened this issue 8 months ago • 11 comments

Dividing xtensor container by some integer scalar (vOutput = vInput / 2), gives super low performance compared to naive implementation with std::transform, while other operations like multiplication (vOutput = vInput * 2), maximum (vOutput = xt::maximum(vInput1, vInput2)) give similar performance to std::transform. We build with xsimd enabled.

Comparing speed of xtensor vs std::transform for different operations:

  • /2: xtensor is x20-40 slower
  • /2.0: xtensor is x2 slower
  • *2: xtensor is 10% slower
  • max: xtensor is same speed

Benchmarks:

static void Xtensor_Uint16_2000x2000_DivideBy2_StdTransform(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue / 2; });
    }
}

static void Xtensor_Uint16_2000x2000_DivideBy2_Xtensor(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        vOutput = vInput / 2;
    }
}

static void Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue / 2.0; });
    }
}

static void Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        vOutput = vInput / 2.0;
    }
}

static void Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue * 2; });
    }
}

static void Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor(benchmark::State& aState) {
    auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput);

    for (auto _ : aState) {
        vOutput = vInput * 2;
    }
}

static void Xtensor_Uint16_2000x2000_Maximum_StdTransform(benchmark::State& aState) {
    auto vInput1 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vInput2 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput1);
    generateRandomInt16From0To100(vInput2);

    for (auto _ : aState) {
        auto vInput2It = vInput2.begin();
        std::transform(vInput1.begin(), vInput1.end(), vOutput.begin(), [&vInput2It](auto&& aInput1Value) { return std::max(aInput1Value, *vInput2It++); });
    }
}

static void Xtensor_Uint16_2000x2000_Maximum_Xtensor(benchmark::State& aState) {
    auto vInput1 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vInput2 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
    generateRandomInt16From0To100(vInput1);
    generateRandomInt16From0To100(vInput2);

    for (auto _ : aState) {
        vOutput = xt::maximum(vInput1, vInput2);
    }
}

Results on ubuntu:

---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform                                       114483 ns       114483 ns         6016
Xtensor_Uint16_2000x2000_DivideBy2_Xtensor                                           4295418 ns      4295440 ns          165
Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform                                 442543 ns       442541 ns         1596
Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor                                      821435 ns       821429 ns          837
Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform                                     115849 ns       115845 ns         5901
Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor                                          131595 ns       131594 ns         5328
Xtensor_Uint16_2000x2000_Maximum_StdTransform                                         204465 ns       203952 ns         3156
Xtensor_Uint16_2000x2000_Maximum_Xtensor                                              198696 ns       198692 ns         3466

Results on windows:

---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform                          764377 ns       767299 ns         1120
Xtensor_Uint16_2000x2000_DivideBy2_Xtensor                             14637306 ns     14687500 ns           50
Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform                   2954759 ns      2966054 ns          289
Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor                        5484534 ns      5503462 ns          451
Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform                        759787 ns       767299 ns          896
Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor                             888914 ns       889369 ns          896
Xtensor_Uint16_2000x2000_Maximum_StdTransform                            993174 ns       976562 ns          640
Xtensor_Uint16_2000x2000_Maximum_Xtensor                                 985171 ns      1000977 ns          640

vakokako avatar May 07 '25 07:05 vakokako

Feel free to include these benchmarks also in xtensor or xtensor-benchmark, in case they are helpful for you?

emmenlau avatar May 07 '25 07:05 emmenlau

@vakokako Have you tried implementing this with xt::noalias? Each time you run this you could be reallocating memory for the output buffer.

spectre-ns avatar May 29 '25 12:05 spectre-ns

Dear @spectre-ns , thanks a lot for your feedback! But we have implemented a patch in xtensor that checks for every assignment if the containers overlap, and if there is zero overlap, defaults to xt::noalias. This is actually merged in official xtensor since a while, so xt::noalias is not required in the majority of cases any more...

emmenlau avatar May 29 '25 19:05 emmenlau

Hi @emmenlau. I'm aware of the patch. I should have been more clear and asked what version of xtensor was used in the test results.

If it's a new version then you're 100% correct. I'll run the profiler on it at some point as well.

spectre-ns avatar May 29 '25 20:05 spectre-ns

@emmenlau @vakokako I have added the body of the code above to the following branch: https://github.com/spectre-ns/xtensor/tree/revive-benchmarks

There is some variation between the execution time but all within a factor of 2x. Can you fine folks compile the branch and let me know if you see the issue?

XTENSOR_USE_XSIMD = True Clang 18.1.8 Release -ffast-math -std=c++20 -march=native (4 Bytes) -O3

spectre-ns avatar May 30 '25 11:05 spectre-ns

This PR has the benchmarks: https://github.com/xtensor-stack/xtensor/pull/2854

spectre-ns avatar May 30 '25 23:05 spectre-ns

@emmenlau @vakokako I have reviewed the performance issue in more depth. I think the issue here is something called integer promotion. When we perform the operation uint16_t{x} * uint16_t{x} with the std::multiplies<> operator, the result type is an int. The types get promoted to avoid overflows. Thus, when using XSIMD the resulting type from the xfunction is an int. The batch assignment in XSIMD cannot directly assign int to uint16_t without a copy as it is a narrowing conversion thus our bad performance. see below for revised benchmarks:


Benchmark Time CPU Iterations

Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 1407733 ns 629579 ns 1365 Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 5640281 ns 5000000 ns 100 Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 2488202 ns 1569475 ns 448 Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 2310595 ns 1919533 ns 407 Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 1114117 ns 903320 ns 640 Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 1598084 ns 1416016 ns 640 Xtensor_Uint16_2000x2000_Maximum_StdTransform 900904 ns 802176 ns 896 Xtensor_Uint16_2000x2000_Maximum_Xtensor 905835 ns 836680 ns 747

Generally speaking mixed precision arithmetic will be slower than if you preserve the precision as it can limit SIMD and force unintended copies. There is something not right with the xtensor division implementation though. I will take a look at the code generation there to see how std::transform is so much better. The rest of the benchmarks look to be close enough to the std implementation that it's not an issue.

spectre-ns avatar Jun 11 '25 01:06 spectre-ns

@emmenlau I re-ran the Xtensor_Uint16_2000x2000_DivideBy2_Xtensor with xt::noalias to avoid the overlap check. I expected to get roughly the results with or without xt::noalias but as you can see below there is a material difference between the two runs where we match std::transform in the final outstanding case. I'm at a complete loss as to why there is a performance difference... Maybe you have some insight!


Benchmark Time CPU Iterations

Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 1977122 ns 656250 ns 1000 Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 1763490 ns 1045850 ns 747 Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 2589741 ns 1902174 ns 345 Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 2713626 ns 2267020 ns 448 Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 1420396 ns 1171875 ns 1120 Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 1879888 ns 1177764 ns 597 Xtensor_Uint16_2000x2000_Maximum_StdTransform 1325627 ns 564575 ns 1024 Xtensor_Uint16_2000x2000_Maximum_Xtensor 1348026 ns 609375 ns 1000

spectre-ns avatar Jun 12 '25 23:06 spectre-ns

Whow, so adding xt::noalias() fixes the performance regression? That is interesting, and the updated timings are impressive! We may be able to take a look at this some time soon.

emmenlau avatar Jul 27 '25 19:07 emmenlau

Whow, so adding xt::noalias() fixes the performance regression? That is interesting, and the updated timings are impressive! We may be able to take a look at this some time soon.

@emmenlau I did the compilation with the intel compiler so it could be down to code generation rather than the actual C++ code. xt::noalias() appears to give the compiler what it wants. Please see if you can re-create what I'm seeing. I added these benchmarks to the CI so it should be easy.

spectre-ns avatar Jul 27 '25 23:07 spectre-ns

Hi, @spectre-ns, we actually don't observe speedup from adding xt::noalias to Xtensor_Uint16_2000x2000_DivideBy2_Xtensor on Ubuntu, compiled with clang 20.1.8.

vakokako avatar Sep 04 '25 17:09 vakokako