Added fft and circular convolution with no dependencies

Open spectre-ns opened this issue 2 years ago • 41 comments

…omplex definitions.

added xfft: -xt::fft::fft -xt::fft::ifft -xt::fft::fftconvolve

Added xfft tests suite to build and added relevant tests for routines referenced above

Checklist

[x] The title and commit message(s) are descriptive.
[ ] Small commits made to fix your PR have been squashed to avoid history pollution.
[x] Tests have been added for new features or bug fixes.
[x] API of new functions and classes are documented.

Description

Apr 17 '22 18:04 spectre-ns

@JohanMabille and @tdegeus the main purpose of this PR is to add an FFT with does not depend on FFTW or other external libraries. I'm looking to implement the interface for numpy.fft.fft. We get a circular convolution for free from the implementation of the bluestien FFT algo. I'm certain I that I haven't gotten close to fftw in terms of performance but I hate having to build / link to FFTW when I'm doing stuff offline. Finally, the licensing of fftw is not as permissive as xtensor (GNU vs BSD). I wanted your input before I solidify the documentation.

Apr 17 '22 19:04 spectre-ns

Thanks @spectre-ns . I have not looked yet at the PR, but here is already a broad comment. I think it'd be great to be able to choose multiple algorithms depending on your needs (in function of the stage of development). Thereby I mean: it would be great to have an API that you can just use with a backend that you can choose.

Apr 19 '22 08:04 tdegeus

Thanks @spectre-ns . I have not looked yet at the PR, but here is already a broad comment. I think it'd be great to be able to choose multiple algorithms depending on your needs (in function of the stage of development). Thereby I mean: it would be great to have an API that you can just use with a backend that you can choose.

@tdegeus yes. I wrote this so it won't conflict with Xtensor-fftw so you can choose to use the more optimized fftw implementation of you need the extra speed but the slower xfft implementation is there if you're looking for something simple with no external dependencies. Python does the same thing with fftw and NumPy.

Apr 19 '22 09:04 spectre-ns

That makes sense. The only thing I was wondering though: should we think about an API in which a user can change the backend without changing much of its code (i.e. without changing the function calls, but by just changing a configuration parameter)? Not sure if this is better, just a thought.

Apr 19 '22 09:04 tdegeus

That makes sense. The only thing I was wondering though: should we think about an API in which a user can change the backend without changing much of its code (i.e. without changing the function calls, but by just changing a configuration parameter)? Not sure if this is better, just a thought.

In python the FFTW and NumPy FFT interfaces are different which is what I was trying to emulate. This could be a good idea I'd have to dispatch at link time? I saw you guys did that for blas (openblas vs mkl) wasn't clear on how you did it though. Alternatively we could accomplish that using a build script and a macro where the xfft defers to fftw with a compile define? I've also found that fftw doesn't handle N dimension data right... Separate issue.

Apr 19 '22 09:04 spectre-ns

On second thought... Because xfft is header only it would have to be a compile flag so that the methods don't get compiled into the translation unit...

Apr 19 '22 10:04 spectre-ns

@tdegeus I could do something like

#ifdef XTENSOR_FFTW
#include <xtensor-fftw/fftw.hpp>
#else
#include <xtensor/xfft.hpp>
#endif

Doing that in a top level header would make the selection a compiler flag provided the interfaces were the same. Only issue is this would make a circular dependency from xtensor->fftw->xtensor... Might not be good design.

Apr 19 '22 14:04 spectre-ns

I would probably do that indeed. Maybe the others could comment @JohanMabille @wolfv @SylvainCorlay ?

Apr 20 '22 13:04 tdegeus

@tdegeus can we work on getting this in to eliminate the need for FFTW dependencies and work on the combined interface in a separate PR?

Apr 22 '22 21:04 spectre-ns

Yes! I was just hoping that the other could share their experience from other parts of xtensor to keep a uniform way of having the drop-in backend

Apr 24 '22 05:04 tdegeus

(Also, I'm working on a few things at the same time, so it will be not before later this week that I can look at this)

Apr 25 '22 15:04 tdegeus

I think that the preprocessor suggestion you made excellent!

Apr 26 '22 14:04 tdegeus

@spectre-ns : This will be absolutely amazing!

Apr 28 '22 15:04 NAThompson

@spectre-ns : This will be absolutely amazing!

@NAThompson Would you be up for some benchmarking on fftw vs xfft? I expect it to be slower but I'm curious by how much. I think it is valuable to have at any rate.

Apr 29 '22 19:04 spectre-ns

@spectre-ns : Could not complete the benchmarking:

~/t/build ❯❯❯ ./fftw_vs_xtensor
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2022-04-30T10:08:54-07:00
Running ./fftw_vs_xtensor
Run on (8 X 24.1209 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x2)
Load Average: 0.95, 1.18, 1.53
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dd00 overflowed to 0x00016bb0dcf8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstrides.hpp:440:44: runtime error: signed integer overflow: 6101720352 * 6101720287 cannot be represented in type 'long'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstrides.hpp:440:44 in
/usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63 in
=================================================================
==3440==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016bb0dd90 at pc 0x00010431f454 bp 0x00016bb0ac10 sp 0x00016bb0ac08
READ of size 8 at 0x00016bb0dd90 thread T0
    #0 0x10431f450 in decltype((std::__1::forward<unsigned long&>(fp)) * (std::__1::forward<unsigned long const&>(fp0))) std::__1::multiplies<void>::operator()<unsigned long&, unsigned long const&>(unsigned long&, unsigned long const&) const operations.h:132
    #1 0x104318ff8 in unsigned long std::__1::accumulate<unsigned long const*, unsigned long, std::__1::multiplies<void> >(unsigned long const*, unsigned long const*, unsigned long, std::__1::multiplies<void>) numeric:187
    #2 0x104317bcc in xt::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>(xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&, unsigned long, unsigned long, unsigned long) xaxis_slice_iterator.hpp:156

Reproducer:

#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>

static void xtensor_fft(benchmark::State& state) {
  std::random_device rd;
  std::mt19937_64 mt(rd());
  std::uniform_real_distribution<double> unif(0.01, 1);
  size_t n = state.range(0);
  xt::xarray<double> x(n);
  for (auto& z : x) {
    z = unif(rd);
  }
  for (auto _ : state) {
    auto z = xt::fft::fft(x);
    benchmark::DoNotOptimize(z);
    x[0] += 8 * std::numeric_limits<double>::epsilon();
  }
  state.SetComplexityN(state.range(0));
}

BENCHMARK(xtensor_fft)->Range(2, 64000)->Complexity();

BENCHMARK_MAIN();

Zipfile with CMakeLists etc attached.

test_xtensor_fft_benchmark.zip

Apr 30 '22 17:04 NAThompson

@spectre-ns : Could not complete the benchmarking:

~/t/build ❯❯❯ ./fftw_vs_xtensor
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2022-04-30T10:08:54-07:00
Running ./fftw_vs_xtensor
Run on (8 X 24.1209 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x2)
Load Average: 0.95, 1.18, 1.53
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dd00 overflowed to 0x00016bb0dcf8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstrides.hpp:440:44: runtime error: signed integer overflow: 6101720352 * 6101720287 cannot be represented in type 'long'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstrides.hpp:440:44 in
/usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63 in
=================================================================
==3440==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016bb0dd90 at pc 0x00010431f454 bp 0x00016bb0ac10 sp 0x00016bb0ac08
READ of size 8 at 0x00016bb0dd90 thread T0
    #0 0x10431f450 in decltype((std::__1::forward<unsigned long&>(fp)) * (std::__1::forward<unsigned long const&>(fp0))) std::__1::multiplies<void>::operator()<unsigned long&, unsigned long const&>(unsigned long&, unsigned long const&) const operations.h:132
    #1 0x104318ff8 in unsigned long std::__1::accumulate<unsigned long const*, unsigned long, std::__1::multiplies<void> >(unsigned long const*, unsigned long const*, unsigned long, std::__1::multiplies<void>) numeric:187
    #2 0x104317bcc in xt::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>(xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&, unsigned long, unsigned long, unsigned long) xaxis_slice_iterator.hpp:156

Reproducer:

#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>

static void xtensor_fft(benchmark::State& state) {
  std::random_device rd;
  std::mt19937_64 mt(rd());
  std::uniform_real_distribution<double> unif(0.01, 1);
  size_t n = state.range(0);
  xt::xarray<double> x(n);
  for (auto& z : x) {
    z = unif(rd);
  }
  for (auto _ : state) {
    auto z = xt::fft::fft(x);
    benchmark::DoNotOptimize(z);
    x[0] += 8 * std::numeric_limits<double>::epsilon();
  }
  state.SetComplexityN(state.range(0));
}

BENCHMARK(xtensor_fft)->Range(2, 64000)->Complexity();

BENCHMARK_MAIN();

Zipfile with CMakeLists etc attached.

test_xtensor_fft_benchmark.zip

@tdegeus @NAThompson Looks like you're running the program through a sanitizer? Those errors are propagating from inside xtensor. I think the library should be safe in terms of undefined behavior so looks like a bug in xtensor. Weird because it compiles and passes on Clang, GCC and MSVC...

Apr 30 '22 17:04 spectre-ns

@spectre-ns : Yeah, I added the -fsanitize=address -fsanitize=undefined flags. I didn't go searching for these until I hit a segfault though.

Apr 30 '22 17:04 NAThompson

@spectre-ns : Yeah, I added the -fsanitize=address -fsanitize=undefined flags. I didn't go searching for these until I hit a segfault though.

Os and compiler?

Apr 30 '22 17:04 spectre-ns

$ clang --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: arm64-apple-darwin21.4.0

Apr 30 '22 17:04 NAThompson

$ clang --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: arm64-apple-darwin21.4.0

@NAThompson Thanks! I was able to reproduce on linux gcc as well. You're issue is you're trying to create a 1D array of n length but you're actually creating a 0D array with a value of n. The slice and fft functions are undefined for that case and therefore give you an error. Try building off this code.

std::random_device rd;
std::mt19937_64 mt(rd());
std::uniform_real_distribution<double> unif(0.01, 1);
std::size_t n = 1024;
auto x = xt::xarray<double>::from_shape({n});
auto z = xt::fft::fft(x);

I should add a check for minimum dimensionality.

Apr 30 '22 18:04 spectre-ns

@spectre-ns : Thanks, that fixed it! That check would certainly be nice for foolish noobs like myself.

Still have an UBSan error but arising from deep in the library:

/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016b205cd0 overflowed to 0x00016b205cc8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in

Apr 30 '22 18:04 NAThompson

Now we have another problem that the complexity is better fit by a cubic in length than NlogN:

~/t/build ❯❯❯ ./fftw_vs_xtensor
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2022-04-30T11:13:45-07:00
Running ./fftw_vs_xtensor
Run on (8 X 24.0835 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x2)
Load Average: 1.59, 1.42, 1.33
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
xtensor_fft/2            910 ns          910 ns       723694
xtensor_fft/8           1192 ns         1192 ns       586682
xtensor_fft/64          4431 ns         4431 ns       158191
xtensor_fft/512        31718 ns        31718 ns        22044
xtensor_fft/4096      277507 ns       277507 ns         2516
xtensor_fft/32768    2545581 ns      2545607 ns          275
xtensor_fft/64000   43173807 ns     43171625 ns           16
xtensor_fft_BigO        0.00 N^3        0.00 N^3
xtensor_fft_RMS           19 %            19 %
fftw_r2c/2              6.66 ns         6.66 ns    105179331
fftw_r2c/8              9.21 ns         9.21 ns     75995267
fftw_r2c/64             70.2 ns         70.2 ns      9977337
fftw_r2c/512             977 ns          977 ns       717537
fftw_r2c/4096           9807 ns         9807 ns        71279
fftw_r2c/32768        238726 ns       238726 ns         2928
fftw_r2c/64000        300668 ns       300668 ns         2323
fftw_r2c_BigO           5.23 N          5.23 N
fftw_r2c_RMS              37 %            37 %

Apr 30 '22 18:04 NAThompson

That's proper slow. lol

Try C2C as the xtensor version treats the real case with the general complex solution.

Apr 30 '22 18:04 spectre-ns

@spectre-ns : Is this only nlogn for power of 2 lengths?

Apr 30 '22 18:04 NAThompson

@spectre-ns : Is this only nlogn for power of 2 lengths?

Yeah the radix2 algorithm relies on factors of 2. It will be way slower in the general case as it decomposes non-factors of 2 into a circular convolution requiring 2 radix2 ffts steps at the next largest N^2 value. FFT for non-2^N sizes is really slow.

Apr 30 '22 18:04 spectre-ns

FFT for non-2^N sizes is really slow.

Hmm. My measurements of FFTW show that power of 3 FFTs are faster than power of 2:

fftw_r2c/2               6.65 ns         6.65 ns    105007351
fftw_r2c/4               7.17 ns         7.17 ns     97609951
fftw_r2c/8               9.23 ns         9.23 ns     75937558
fftw_r2c/16              15.4 ns         15.4 ns     45567280
fftw_r2c/32              32.8 ns         32.8 ns     21302431
fftw_r2c/64              70.1 ns         70.1 ns      9988299
fftw_r2c/128              165 ns          165 ns      4234725
fftw_r2c/256              538 ns          538 ns      1340637
fftw_r2c/512              975 ns          975 ns       717573
fftw_r2c/1024            2130 ns         2130 ns       328754
fftw_r2c/2048            4499 ns         4499 ns       155592
fftw_r2c/4096            9823 ns         9823 ns        71161
fftw_r2c/8192           24030 ns        24030 ns        29139
fftw_r2c/16384          53946 ns        53946 ns        12966
fftw_r2c/32768         238757 ns       238757 ns         2931
fftw_r2c/65536         513781 ns       513782 ns         1364
fftw_r2c/131072       1050938 ns      1050962 ns          665
fftw_r2c/262144       2357824 ns      2354483 ns          298
fftw_r2c_BigO            0.49 NlgN       0.49 NlgN
fftw_r2c_RMS                9 %             9 %
fftw_r2c/3               9.09 ns         8.90 ns     95773646
fftw_r2c/9               14.6 ns         14.5 ns     48483169
fftw_r2c/27               115 ns          114 ns      6033339
fftw_r2c/81               282 ns          280 ns      2483388
fftw_r2c/243             1032 ns         1027 ns       684938
fftw_r2c/729             2509 ns         2465 ns       248990
fftw_r2c/2187            9890 ns         9884 ns        71227
fftw_r2c/6561           28191 ns        28150 ns        25171
fftw_r2c/19683         111384 ns       111384 ns         6275
fftw_r2c/59049         323147 ns       323147 ns         2167
fftw_r2c/177147       1170404 ns      1170412 ns          590
fftw_r2c/300000       1937187 ns      1937190 ns          358
fftw_r2c_BigO            0.36 NlgN       0.36 NlgN
fftw_r2c_RMS                6 %             6 %

Apr 30 '22 18:04 NAThompson

FFT for non-2^N sizes is really slow.

Hmm. My measurements of FFTW show that power of 3 FFTs are faster than power of 2:

fftw_r2c/2               6.65 ns         6.65 ns    105007351
fftw_r2c/4               7.17 ns         7.17 ns     97609951
fftw_r2c/8               9.23 ns         9.23 ns     75937558
fftw_r2c/16              15.4 ns         15.4 ns     45567280
fftw_r2c/32              32.8 ns         32.8 ns     21302431
fftw_r2c/64              70.1 ns         70.1 ns      9988299
fftw_r2c/128              165 ns          165 ns      4234725
fftw_r2c/256              538 ns          538 ns      1340637
fftw_r2c/512              975 ns          975 ns       717573
fftw_r2c/1024            2130 ns         2130 ns       328754
fftw_r2c/2048            4499 ns         4499 ns       155592
fftw_r2c/4096            9823 ns         9823 ns        71161
fftw_r2c/8192           24030 ns        24030 ns        29139
fftw_r2c/16384          53946 ns        53946 ns        12966
fftw_r2c/32768         238757 ns       238757 ns         2931
fftw_r2c/65536         513781 ns       513782 ns         1364
fftw_r2c/131072       1050938 ns      1050962 ns          665
fftw_r2c/262144       2357824 ns      2354483 ns          298
fftw_r2c_BigO            0.49 NlgN       0.49 NlgN
fftw_r2c_RMS                9 %             9 %
fftw_r2c/3               9.09 ns         8.90 ns     95773646
fftw_r2c/9               14.6 ns         14.5 ns     48483169
fftw_r2c/27               115 ns          114 ns      6033339
fftw_r2c/81               282 ns          280 ns      2483388
fftw_r2c/243             1032 ns         1027 ns       684938
fftw_r2c/729             2509 ns         2465 ns       248990
fftw_r2c/2187            9890 ns         9884 ns        71227
fftw_r2c/6561           28191 ns        28150 ns        25171
fftw_r2c/19683         111384 ns       111384 ns         6275
fftw_r2c/59049         323147 ns       323147 ns         2167
fftw_r2c/177147       1170404 ns      1170412 ns          590
fftw_r2c/300000       1937187 ns      1937190 ns          358
fftw_r2c_BigO            0.36 NlgN       0.36 NlgN
fftw_r2c_RMS                6 %             6 %

Yeah you'll get a different optimization for r2c vs c2c. Try c2c and see what you get.

Apr 30 '22 18:04 spectre-ns

I have now improved the benchmark:

#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>

static void xtensor_fft(benchmark::State& state) {
  std::random_device rd;
  std::mt19937_64 mt(rd());
  std::uniform_real_distribution<double> unif(0.01, 1);
  size_t n = state.range(0);
  auto x = xt::xarray<double>::from_shape({n});
  for (auto& z : x) {
    z = unif(rd);
  }
  for (auto _ : state) {
    auto z = xt::fft::fft(x);
    benchmark::DoNotOptimize(z);
    x[0] += 8 * std::numeric_limits<double>::epsilon();
  }
  state.SetComplexityN(state.range(0));
}

BENCHMARK(xtensor_fft)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(xtensor_fft)->RangeMultiplier(3)->Range(3, 300000)->Complexity();

static void fftw_r2c(benchmark::State& state) {
   std::random_device rd;
   std::mt19937_64 mt(rd());
   std::uniform_real_distribution<double> unif(-1.0, 1.0);
   size_t n = state.range(0);
   std::vector<double> x(n);
   for (auto & z : x) {
      z = unif(rd);
   }
   auto gamma = fftw_alloc_complex(n);
   fftw_plan plan = fftw_plan_dft_r2c_1d(n, x.data(), gamma, FFTW_ESTIMATE);
   if (!plan) {
      throw std::domain_error("Could not create r2c fftw plan");
   }

   for (auto _ : state) {
     fftw_execute(plan);
     benchmark::DoNotOptimize(x[0]);
   }

   fftw_destroy_plan(plan);
   state.SetComplexityN(state.range(0));
}

BENCHMARK(fftw_r2c)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(fftw_r2c)->RangeMultiplier(3)->Range(3, 300000)->Complexity();

BENCHMARK_MAIN();

This gives the following data:

xtensor_fft/2             913 ns          913 ns       715505
xtensor_fft/4            1007 ns         1007 ns       694686
xtensor_fft/8            1190 ns         1190 ns       589628
xtensor_fft/16           1632 ns         1632 ns       434181
xtensor_fft/32           2572 ns         2572 ns       270873
xtensor_fft/64           4438 ns         4438 ns       157378
xtensor_fft/128          8143 ns         8143 ns        85501
xtensor_fft/256         15798 ns        15798 ns        44050
xtensor_fft/512         31404 ns        31404 ns        22260
xtensor_fft/1024        64495 ns        64495 ns        10844
xtensor_fft/2048       134408 ns       134408 ns         5203
xtensor_fft/4096       278958 ns       278959 ns         2507
xtensor_fft/8192       582005 ns       582005 ns         1203
xtensor_fft/16384     1202766 ns      1202758 ns          582
xtensor_fft/32768     2506735 ns      2506735 ns          279
xtensor_fft/65536     5284727 ns      5284689 ns          132
xtensor_fft/131072   11285511 ns     11285419 ns           62
xtensor_fft/262144   23902101 ns     23902103 ns           29
xtensor_fft_BigO         5.07 NlgN       5.07 NlgN
xtensor_fft_RMS             1 %             1 %
xtensor_fft/3            6845 ns         6845 ns       102105
xtensor_fft/9           12424 ns        12424 ns        56418
xtensor_fft/27          20209 ns        20209 ns        34255
xtensor_fft/81          62942 ns        62942 ns        11103
xtensor_fft/243        125608 ns       125608 ns         5568
xtensor_fft/729        501351 ns       501350 ns         1392
xtensor_fft/2187      2104784 ns      2104783 ns          332
xtensor_fft/6561      4683574 ns      4683577 ns          149
xtensor_fft/19683    19299995 ns     19300028 ns           36
xtensor_fft/59049    42841047 ns     42841062 ns           16
xtensor_fft/177147  188836823 ns    188836750 ns            4
xtensor_fft/300000  399064125 ns    399064000 ns            2
xtensor_fft_BigO        69.67 NlgN      69.67 NlgN
xtensor_fft_RMS            21 %            21 %
fftw_r2c/2               6.65 ns         6.65 ns    105007351
fftw_r2c/4               7.17 ns         7.17 ns     97609951
fftw_r2c/8               9.23 ns         9.23 ns     75937558
fftw_r2c/16              15.4 ns         15.4 ns     45567280
fftw_r2c/32              32.8 ns         32.8 ns     21302431
fftw_r2c/64              70.1 ns         70.1 ns      9988299
fftw_r2c/128              165 ns          165 ns      4234725
fftw_r2c/256              538 ns          538 ns      1340637
fftw_r2c/512              975 ns          975 ns       717573
fftw_r2c/1024            2130 ns         2130 ns       328754
fftw_r2c/2048            4499 ns         4499 ns       155592
fftw_r2c/4096            9823 ns         9823 ns        71161
fftw_r2c/8192           24030 ns        24030 ns        29139
fftw_r2c/16384          53946 ns        53946 ns        12966
fftw_r2c/32768         238757 ns       238757 ns         2931
fftw_r2c/65536         513781 ns       513782 ns         1364
fftw_r2c/131072       1050938 ns      1050962 ns          665
fftw_r2c/262144       2357824 ns      2354483 ns          298
fftw_r2c_BigO            0.49 NlgN       0.49 NlgN
fftw_r2c_RMS                9 %             9 %
fftw_r2c/3               9.09 ns         8.90 ns     95773646
fftw_r2c/9               14.6 ns         14.5 ns     48483169
fftw_r2c/27               115 ns          114 ns      6033339
fftw_r2c/81               282 ns          280 ns      2483388
fftw_r2c/243             1032 ns         1027 ns       684938
fftw_r2c/729             2509 ns         2465 ns       248990
fftw_r2c/2187            9890 ns         9884 ns        71227
fftw_r2c/6561           28191 ns        28150 ns        25171
fftw_r2c/19683         111384 ns       111384 ns         6275
fftw_r2c/59049         323147 ns       323147 ns         2167
fftw_r2c/177147       1170404 ns      1170412 ns          590
fftw_r2c/300000       1937187 ns      1937190 ns          358
fftw_r2c_BigO            0.36 NlgN       0.36 NlgN
fftw_r2c_RMS                6 %             6 %

Really it looks like the length-2 and length-3 case could be sped up quite a bit without too much effort and I think you could get competitive.

Apr 30 '22 18:04 NAThompson

I have now improved the benchmark:

#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>

static void xtensor_fft(benchmark::State& state) {
  std::random_device rd;
  std::mt19937_64 mt(rd());
  std::uniform_real_distribution<double> unif(0.01, 1);
  size_t n = state.range(0);
  auto x = xt::xarray<double>::from_shape({n});
  for (auto& z : x) {
    z = unif(rd);
  }
  for (auto _ : state) {
    auto z = xt::fft::fft(x);
    benchmark::DoNotOptimize(z);
    x[0] += 8 * std::numeric_limits<double>::epsilon();
  }
  state.SetComplexityN(state.range(0));
}

BENCHMARK(xtensor_fft)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(xtensor_fft)->RangeMultiplier(3)->Range(3, 300000)->Complexity();

static void fftw_r2c(benchmark::State& state) {
   std::random_device rd;
   std::mt19937_64 mt(rd());
   std::uniform_real_distribution<double> unif(-1.0, 1.0);
   size_t n = state.range(0);
   std::vector<double> x(n);
   for (auto & z : x) {
      z = unif(rd);
   }
   auto gamma = fftw_alloc_complex(n);
   fftw_plan plan = fftw_plan_dft_r2c_1d(n, x.data(), gamma, FFTW_ESTIMATE);
   if (!plan) {
      throw std::domain_error("Could not create r2c fftw plan");
   }

   for (auto _ : state) {
     fftw_execute(plan);
     benchmark::DoNotOptimize(x[0]);
   }

   fftw_destroy_plan(plan);
   state.SetComplexityN(state.range(0));
}

BENCHMARK(fftw_r2c)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(fftw_r2c)->RangeMultiplier(3)->Range(3, 300000)->Complexity();

BENCHMARK_MAIN();

This gives the following data:

xtensor_fft/2             913 ns          913 ns       715505
xtensor_fft/4            1007 ns         1007 ns       694686
xtensor_fft/8            1190 ns         1190 ns       589628
xtensor_fft/16           1632 ns         1632 ns       434181
xtensor_fft/32           2572 ns         2572 ns       270873
xtensor_fft/64           4438 ns         4438 ns       157378
xtensor_fft/128          8143 ns         8143 ns        85501
xtensor_fft/256         15798 ns        15798 ns        44050
xtensor_fft/512         31404 ns        31404 ns        22260
xtensor_fft/1024        64495 ns        64495 ns        10844
xtensor_fft/2048       134408 ns       134408 ns         5203
xtensor_fft/4096       278958 ns       278959 ns         2507
xtensor_fft/8192       582005 ns       582005 ns         1203
xtensor_fft/16384     1202766 ns      1202758 ns          582
xtensor_fft/32768     2506735 ns      2506735 ns          279
xtensor_fft/65536     5284727 ns      5284689 ns          132
xtensor_fft/131072   11285511 ns     11285419 ns           62
xtensor_fft/262144   23902101 ns     23902103 ns           29
xtensor_fft_BigO         5.07 NlgN       5.07 NlgN
xtensor_fft_RMS             1 %             1 %
xtensor_fft/3            6845 ns         6845 ns       102105
xtensor_fft/9           12424 ns        12424 ns        56418
xtensor_fft/27          20209 ns        20209 ns        34255
xtensor_fft/81          62942 ns        62942 ns        11103
xtensor_fft/243        125608 ns       125608 ns         5568
xtensor_fft/729        501351 ns       501350 ns         1392
xtensor_fft/2187      2104784 ns      2104783 ns          332
xtensor_fft/6561      4683574 ns      4683577 ns          149
xtensor_fft/19683    19299995 ns     19300028 ns           36
xtensor_fft/59049    42841047 ns     42841062 ns           16
xtensor_fft/177147  188836823 ns    188836750 ns            4
xtensor_fft/300000  399064125 ns    399064000 ns            2
xtensor_fft_BigO        69.67 NlgN      69.67 NlgN
xtensor_fft_RMS            21 %            21 %
fftw_r2c/2               6.65 ns         6.65 ns    105007351
fftw_r2c/4               7.17 ns         7.17 ns     97609951
fftw_r2c/8               9.23 ns         9.23 ns     75937558
fftw_r2c/16              15.4 ns         15.4 ns     45567280
fftw_r2c/32              32.8 ns         32.8 ns     21302431
fftw_r2c/64              70.1 ns         70.1 ns      9988299
fftw_r2c/128              165 ns          165 ns      4234725
fftw_r2c/256              538 ns          538 ns      1340637
fftw_r2c/512              975 ns          975 ns       717573
fftw_r2c/1024            2130 ns         2130 ns       328754
fftw_r2c/2048            4499 ns         4499 ns       155592
fftw_r2c/4096            9823 ns         9823 ns        71161
fftw_r2c/8192           24030 ns        24030 ns        29139
fftw_r2c/16384          53946 ns        53946 ns        12966
fftw_r2c/32768         238757 ns       238757 ns         2931
fftw_r2c/65536         513781 ns       513782 ns         1364
fftw_r2c/131072       1050938 ns      1050962 ns          665
fftw_r2c/262144       2357824 ns      2354483 ns          298
fftw_r2c_BigO            0.49 NlgN       0.49 NlgN
fftw_r2c_RMS                9 %             9 %
fftw_r2c/3               9.09 ns         8.90 ns     95773646
fftw_r2c/9               14.6 ns         14.5 ns     48483169
fftw_r2c/27               115 ns          114 ns      6033339
fftw_r2c/81               282 ns          280 ns      2483388
fftw_r2c/243             1032 ns         1027 ns       684938
fftw_r2c/729             2509 ns         2465 ns       248990
fftw_r2c/2187            9890 ns         9884 ns        71227
fftw_r2c/6561           28191 ns        28150 ns        25171
fftw_r2c/19683         111384 ns       111384 ns         6275
fftw_r2c/59049         323147 ns       323147 ns         2167
fftw_r2c/177147       1170404 ns      1170412 ns          590
fftw_r2c/300000       1937187 ns      1937190 ns          358
fftw_r2c_BigO            0.36 NlgN       0.36 NlgN
fftw_r2c_RMS                6 %             6 %

Really it looks like the length-2 and length-3 case could be sped up quite a bit without too much effort and I think you could get competitive.

Looks like I'm way slower... I don't think I'm going to be anywhere close to competitive.

Apr 30 '22 18:04 spectre-ns

@NAThompson You can probably give a nice little speed boost by enabling xsimd.

Apr 30 '22 18:04 spectre-ns

xtensor xtensor copied to clipboard

Added fft and circular convolution with no dependencies

Checklist

Description

xtensor
xtensor copied to clipboard