xsimd
xsimd copied to clipboard
Complex Calculate use avx2 are slower 3 times than normal
use your example ,only modify operate. the std::vector<double, xsimd::aligned_allocator
xsimd::sqrt((xsimd::cos(ba) + xsimd::sin(bb)) / 2) use time 12s is slower than std::sqrt((std::cos(a[i]) + std::sin(b[i])) / 2) use time 4.4s
Sorry, I cannot reproduce your timings. Here is the test program I've been using
#include <iostream>
#include <xsimd/xsimd.hpp>
#include <vector>
int main(int argc, char** argv)
{
unsigned n = std::atoi(argv[1]);
unsigned p = std::atoi(argv[2]);
std::vector<double> x(n);
std::vector<double> y(n);
std::vector<double> out(n);
for(unsigned i = 0; i < n; ++i) {
x[i] = .00002 * i;
y[i] = .00003 * i;
}
for(unsigned j = 0; j < p; ++j) {
#ifdef USE_XSIMD
for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) {
auto vout = xsimd::load_unaligned(&out[i]);
auto vx = xsimd::load_unaligned(&x[i]);
auto vy = xsimd::load_unaligned(&y[i]);
vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.);
vout.store_unaligned(&out[i]);
}
#else
for(unsigned i = 0; i < n; ++i) {
out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2);
}
#endif
}
std::cout << out[n / p] << "\n";
return 0;
}
compiled with g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100
or g++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100
I get a consistent x2.5 speedup with xsimd on... same for -O3
and with clang.
Sorry, I cannot reproduce your timings. Here is the test program I've been using
#include <iostream> #include <xsimd/xsimd.hpp> #include <vector> int main(int argc, char** argv) { unsigned n = std::atoi(argv[1]); unsigned p = std::atoi(argv[2]); std::vector<double> x(n); std::vector<double> y(n); std::vector<double> out(n); for(unsigned i = 0; i < n; ++i) { x[i] = .00002 * i; y[i] = .00003 * i; } for(unsigned j = 0; j < p; ++j) { #ifdef USE_XSIMD for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) { auto vout = xsimd::load_unaligned(&out[i]); auto vx = xsimd::load_unaligned(&x[i]); auto vy = xsimd::load_unaligned(&y[i]); vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.); vout.store_unaligned(&out[i]); } #else for(unsigned i = 0; i < n; ++i) { out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2); } #endif } std::cout << out[n / p] << "\n"; return 0; }
compiled with
g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100
org++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100
I get a consistent x2.5 speedup with xsimd on... same for
-O3
and with clang.
I use your code ,get the same result,even if i change the n=1e8,p=1. but when I change the data with my input
for(unsigned i = 0; i < n; ++i) {
//x[i] = .00002 * i;
// y[i] = .00003 * i;
x[i] = i;
y[i] = std::sin(i);
}
the time is Very different. the n=1e8,p=1. (12s with 4s). with n=1e6,p=1e2 time 1.1s(whit xsimd avx2) to 2.4s(normal) . I think maybe 2 reason. 1.the data input ,2 the data size could affect the time cost.
Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.
Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.
the code you use n=1e6,p=1e2 . with this parameter ,time ture speedup,I get the same with you. but when n=1e8,p=1,time will different,as i description above. when i change your code
//unsigned n = std::atoi("1000000");
// unsigned p = std::atoi("100");
unsigned n = std::atoi("100000000");
unsigned p = std::atoi("1");
unfortunately,can't execute. with n = std::atoi("1000000") p = std::atoi("100") certainly the time with yours. so if you make the n size enough,you will find time xmind will cost more, i debug find the internal ximd::sin or simd::cos time-consuming. if the data size small .such n=1e6 the function will not,but if you use n=1e8 and use std::sqrt((std::cos(x[i]) + std::sin(y[i])) / 2),not use std::sqrt((xmind::cos(x[i]) + xmind::sin(y[i])) / 2)
time more
that great impact ximd::cos or ximd::sin, if the data size is small ,such you use n=1e6,not big different,but n=1e8,the result big different.)