Dividing xtensor by a scalar is x20-40 times slower than using std::transform
Dividing xtensor container by some integer scalar (vOutput = vInput / 2), gives super low performance compared to naive implementation with std::transform, while other operations like multiplication (vOutput = vInput * 2), maximum (vOutput = xt::maximum(vInput1, vInput2)) give similar performance to std::transform. We build with xsimd enabled.
Comparing speed of xtensor vs std::transform for different operations:
/2:xtensoris x20-40 slower/2.0:xtensoris x2 slower*2:xtensoris 10% slowermax:xtensoris same speed
Benchmarks:
static void Xtensor_Uint16_2000x2000_DivideBy2_StdTransform(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue / 2; });
}
}
static void Xtensor_Uint16_2000x2000_DivideBy2_Xtensor(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
vOutput = vInput / 2;
}
}
static void Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue / 2.0; });
}
}
static void Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
vOutput = vInput / 2.0;
}
}
static void Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
std::transform(vInput.begin(), vInput.end(), vOutput.begin(), [](auto&& aInputValue) { return aInputValue * 2; });
}
}
static void Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor(benchmark::State& aState) {
auto vInput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput);
for (auto _ : aState) {
vOutput = vInput * 2;
}
}
static void Xtensor_Uint16_2000x2000_Maximum_StdTransform(benchmark::State& aState) {
auto vInput1 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vInput2 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput1);
generateRandomInt16From0To100(vInput2);
for (auto _ : aState) {
auto vInput2It = vInput2.begin();
std::transform(vInput1.begin(), vInput1.end(), vOutput.begin(), [&vInput2It](auto&& aInput1Value) { return std::max(aInput1Value, *vInput2It++); });
}
}
static void Xtensor_Uint16_2000x2000_Maximum_Xtensor(benchmark::State& aState) {
auto vInput1 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vInput2 = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
auto vOutput = xt::xtensor<uint16_t, 2>::from_shape(cContainerAssignShape);
generateRandomInt16From0To100(vInput1);
generateRandomInt16From0To100(vInput2);
for (auto _ : aState) {
vOutput = xt::maximum(vInput1, vInput2);
}
}
Results on ubuntu:
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 114483 ns 114483 ns 6016
Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 4295418 ns 4295440 ns 165
Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 442543 ns 442541 ns 1596
Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 821435 ns 821429 ns 837
Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 115849 ns 115845 ns 5901
Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 131595 ns 131594 ns 5328
Xtensor_Uint16_2000x2000_Maximum_StdTransform 204465 ns 203952 ns 3156
Xtensor_Uint16_2000x2000_Maximum_Xtensor 198696 ns 198692 ns 3466
Results on windows:
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 764377 ns 767299 ns 1120
Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 14637306 ns 14687500 ns 50
Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 2954759 ns 2966054 ns 289
Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 5484534 ns 5503462 ns 451
Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 759787 ns 767299 ns 896
Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 888914 ns 889369 ns 896
Xtensor_Uint16_2000x2000_Maximum_StdTransform 993174 ns 976562 ns 640
Xtensor_Uint16_2000x2000_Maximum_Xtensor 985171 ns 1000977 ns 640
Feel free to include these benchmarks also in xtensor or xtensor-benchmark, in case they are helpful for you?
@vakokako Have you tried implementing this with xt::noalias? Each time you run this you could be reallocating memory for the output buffer.
Dear @spectre-ns , thanks a lot for your feedback! But we have implemented a patch in xtensor that checks for every assignment if the containers overlap, and if there is zero overlap, defaults to xt::noalias. This is actually merged in official xtensor since a while, so xt::noalias is not required in the majority of cases any more...
Hi @emmenlau. I'm aware of the patch. I should have been more clear and asked what version of xtensor was used in the test results.
If it's a new version then you're 100% correct. I'll run the profiler on it at some point as well.
@emmenlau @vakokako I have added the body of the code above to the following branch: https://github.com/spectre-ns/xtensor/tree/revive-benchmarks
There is some variation between the execution time but all within a factor of 2x. Can you fine folks compile the branch and let me know if you see the issue?
XTENSOR_USE_XSIMD = True Clang 18.1.8 Release -ffast-math -std=c++20 -march=native (4 Bytes) -O3
This PR has the benchmarks: https://github.com/xtensor-stack/xtensor/pull/2854
@emmenlau @vakokako I have reviewed the performance issue in more depth. I think the issue here is something called integer promotion. When we perform the operation uint16_t{x} * uint16_t{x} with the std::multiplies<> operator, the result type is an int. The types get promoted to avoid overflows. Thus, when using XSIMD the resulting type from the xfunction is an int. The batch assignment in XSIMD cannot directly assign int to uint16_t without a copy as it is a narrowing conversion thus our bad performance. see below for revised benchmarks:
Benchmark Time CPU Iterations
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 1407733 ns 629579 ns 1365 Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 5640281 ns 5000000 ns 100 Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 2488202 ns 1569475 ns 448 Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 2310595 ns 1919533 ns 407 Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 1114117 ns 903320 ns 640 Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 1598084 ns 1416016 ns 640 Xtensor_Uint16_2000x2000_Maximum_StdTransform 900904 ns 802176 ns 896 Xtensor_Uint16_2000x2000_Maximum_Xtensor 905835 ns 836680 ns 747
Generally speaking mixed precision arithmetic will be slower than if you preserve the precision as it can limit SIMD and force unintended copies. There is something not right with the xtensor division implementation though. I will take a look at the code generation there to see how std::transform is so much better. The rest of the benchmarks look to be close enough to the std implementation that it's not an issue.
@emmenlau I re-ran the Xtensor_Uint16_2000x2000_DivideBy2_Xtensor with xt::noalias to avoid the overlap check. I expected to get roughly the results with or without xt::noalias but as you can see below there is a material difference between the two runs where we match std::transform in the final outstanding case. I'm at a complete loss as to why there is a performance difference... Maybe you have some insight!
Benchmark Time CPU Iterations
Xtensor_Uint16_2000x2000_DivideBy2_StdTransform 1977122 ns 656250 ns 1000 Xtensor_Uint16_2000x2000_DivideBy2_Xtensor 1763490 ns 1045850 ns 747 Xtensor_Uint16_2000x2000_DivideBy2Double_StdTransform 2589741 ns 1902174 ns 345 Xtensor_Uint16_2000x2000_DivideBy2Double_Xtensor 2713626 ns 2267020 ns 448 Xtensor_Uint16_2000x2000_MultiplyBy2_StdTransform 1420396 ns 1171875 ns 1120 Xtensor_Uint16_2000x2000_MultiplyBy2_Xtensor 1879888 ns 1177764 ns 597 Xtensor_Uint16_2000x2000_Maximum_StdTransform 1325627 ns 564575 ns 1024 Xtensor_Uint16_2000x2000_Maximum_Xtensor 1348026 ns 609375 ns 1000
Whow, so adding xt::noalias() fixes the performance regression? That is interesting, and the updated timings are impressive! We may be able to take a look at this some time soon.
Whow, so adding
xt::noalias()fixes the performance regression? That is interesting, and the updated timings are impressive! We may be able to take a look at this some time soon.
@emmenlau I did the compilation with the intel compiler so it could be down to code generation rather than the actual C++ code. xt::noalias() appears to give the compiler what it wants. Please see if you can re-create what I'm seeing. I added these benchmarks to the CI so it should be easy.
Hi, @spectre-ns, we actually don't observe speedup from adding xt::noalias to Xtensor_Uint16_2000x2000_DivideBy2_Xtensor on Ubuntu, compiled with clang 20.1.8.