xsimd icon indicating copy to clipboard operation
xsimd copied to clipboard

Complex Calculate use avx2 are slower 3 times than normal

Open YggSky opened this issue 2 years ago • 4 comments

use your example ,only modify operate. the std::vector<double, xsimd::aligned_allocator> size is 1e8.

xsimd::sqrt((xsimd::cos(ba) + xsimd::sin(bb)) / 2) use time 12s is slower than std::sqrt((std::cos(a[i]) + std::sin(b[i])) / 2) use time 4.4s

YggSky avatar Jan 07 '23 07:01 YggSky

Sorry, I cannot reproduce your timings. Here is the test program I've been using

#include <iostream>
#include <xsimd/xsimd.hpp>
#include <vector>

int main(int argc, char** argv)
{
  unsigned n = std::atoi(argv[1]);
  unsigned p = std::atoi(argv[2]);
  std::vector<double> x(n);
  std::vector<double> y(n);
  std::vector<double> out(n);
  for(unsigned i = 0; i < n; ++i) {
    x[i] = .00002 * i;
    y[i] = .00003 * i;
  }
  for(unsigned j = 0; j < p; ++j) {
#ifdef USE_XSIMD
    for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) {
      auto vout = xsimd::load_unaligned(&out[i]);
      auto vx = xsimd::load_unaligned(&x[i]);
      auto vy = xsimd::load_unaligned(&y[i]);
      vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.);
      vout.store_unaligned(&out[i]);
    }
#else
    for(unsigned i = 0; i < n; ++i) {
      out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2);
    }
#endif
  }
  std::cout << out[n / p] << "\n";
  return 0;
}

compiled with g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100 or g++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100

I get a consistent x2.5 speedup with xsimd on... same for -O3 and with clang.

serge-sans-paille avatar Jan 10 '23 18:01 serge-sans-paille

Sorry, I cannot reproduce your timings. Here is the test program I've been using

#include <iostream>
#include <xsimd/xsimd.hpp>
#include <vector>

int main(int argc, char** argv)
{
  unsigned n = std::atoi(argv[1]);
  unsigned p = std::atoi(argv[2]);
  std::vector<double> x(n);
  std::vector<double> y(n);
  std::vector<double> out(n);
  for(unsigned i = 0; i < n; ++i) {
    x[i] = .00002 * i;
    y[i] = .00003 * i;
  }
  for(unsigned j = 0; j < p; ++j) {
#ifdef USE_XSIMD
    for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) {
      auto vout = xsimd::load_unaligned(&out[i]);
      auto vx = xsimd::load_unaligned(&x[i]);
      auto vy = xsimd::load_unaligned(&y[i]);
      vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.);
      vout.store_unaligned(&out[i]);
    }
#else
    for(unsigned i = 0; i < n; ++i) {
      out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2);
    }
#endif
  }
  std::cout << out[n / p] << "\n";
  return 0;
}

compiled with g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100 or g++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100

I get a consistent x2.5 speedup with xsimd on... same for -O3 and with clang.

I use your code ,get the same result,even if i change the n=1e8,p=1. but when I change the data with my input

for(unsigned i = 0; i < n; ++i) {
    //x[i] = .00002 * i;
  // y[i] = .00003 * i;
	x[i] = i;
	y[i] = std::sin(i);
  }

the time is Very different. the n=1e8,p=1. (12s with 4s). with n=1e6,p=1e2 time 1.1s(whit xsimd avx2) to 2.4s(normal) . I think maybe 2 reason. 1.the data input ,2 the data size could affect the time cost.

YggSky avatar Jan 11 '23 06:01 YggSky

Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.

amyspark avatar Jan 13 '23 14:01 amyspark

Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.

the code you use n=1e6,p=1e2 . with this parameter ,time ture speedup,I get the same with you. but when n=1e8,p=1,time will different,as i description above. when i change your code

 //unsigned n = std::atoi("1000000");
//    unsigned p = std::atoi("100");

unsigned n = std::atoi("100000000");
unsigned p = std::atoi("1");

unfortunately,can't execute. with n = std::atoi("1000000") p = std::atoi("100") certainly the time with yours. so if you make the n size enough,you will find time xmind will cost more, i debug find the internal ximd::sin or simd::cos time-consuming. if the data size small .such n=1e6 the function will not,but if you use n=1e8 and use std::sqrt((std::cos(x[i]) + std::sin(y[i])) / 2),not use std::sqrt((xmind::cos(x[i]) + xmind::sin(y[i])) / 2) time more

that great impact ximd::cos or ximd::sin, if the data size is small ,such you use n=1e6,not big different,but n=1e8,the result big different.)

YggSky avatar Jan 15 '23 01:01 YggSky