xtensor
xtensor copied to clipboard
Added fft and circular convolution with no dependencies
…omplex definitions.
added xfft: -xt::fft::fft -xt::fft::ifft -xt::fft::fftconvolve
Added xfft tests suite to build and added relevant tests for routines referenced above
Checklist
- [x] The title and commit message(s) are descriptive.
- [ ] Small commits made to fix your PR have been squashed to avoid history pollution.
- [x] Tests have been added for new features or bug fixes.
- [x] API of new functions and classes are documented.
Description
@JohanMabille and @tdegeus the main purpose of this PR is to add an FFT with does not depend on FFTW or other external libraries. I'm looking to implement the interface for numpy.fft.fft. We get a circular convolution for free from the implementation of the bluestien FFT algo. I'm certain I that I haven't gotten close to fftw in terms of performance but I hate having to build / link to FFTW when I'm doing stuff offline. Finally, the licensing of fftw is not as permissive as xtensor (GNU vs BSD). I wanted your input before I solidify the documentation.
Thanks @spectre-ns . I have not looked yet at the PR, but here is already a broad comment. I think it'd be great to be able to choose multiple algorithms depending on your needs (in function of the stage of development). Thereby I mean: it would be great to have an API that you can just use with a backend that you can choose.
Thanks @spectre-ns . I have not looked yet at the PR, but here is already a broad comment. I think it'd be great to be able to choose multiple algorithms depending on your needs (in function of the stage of development). Thereby I mean: it would be great to have an API that you can just use with a backend that you can choose.
@tdegeus yes. I wrote this so it won't conflict with Xtensor-fftw so you can choose to use the more optimized fftw implementation of you need the extra speed but the slower xfft implementation is there if you're looking for something simple with no external dependencies. Python does the same thing with fftw and NumPy.
That makes sense. The only thing I was wondering though: should we think about an API in which a user can change the backend without changing much of its code (i.e. without changing the function calls, but by just changing a configuration parameter)? Not sure if this is better, just a thought.
That makes sense. The only thing I was wondering though: should we think about an API in which a user can change the backend without changing much of its code (i.e. without changing the function calls, but by just changing a configuration parameter)? Not sure if this is better, just a thought.
In python the FFTW and NumPy FFT interfaces are different which is what I was trying to emulate. This could be a good idea I'd have to dispatch at link time? I saw you guys did that for blas (openblas vs mkl) wasn't clear on how you did it though. Alternatively we could accomplish that using a build script and a macro where the xfft defers to fftw with a compile define? I've also found that fftw doesn't handle N dimension data right... Separate issue.
On second thought... Because xfft is header only it would have to be a compile flag so that the methods don't get compiled into the translation unit...
@tdegeus I could do something like
#ifdef XTENSOR_FFTW
#include <xtensor-fftw/fftw.hpp>
#else
#include <xtensor/xfft.hpp>
#endif
Doing that in a top level header would make the selection a compiler flag provided the interfaces were the same. Only issue is this would make a circular dependency from xtensor->fftw->xtensor... Might not be good design.
I would probably do that indeed. Maybe the others could comment @JohanMabille @wolfv @SylvainCorlay ?
@tdegeus can we work on getting this in to eliminate the need for FFTW dependencies and work on the combined interface in a separate PR?
Yes! I was just hoping that the other could share their experience from other parts of xtensor to keep a uniform way of having the drop-in backend
(Also, I'm working on a few things at the same time, so it will be not before later this week that I can look at this)
I think that the preprocessor suggestion you made excellent!
@spectre-ns : This will be absolutely amazing!
@spectre-ns : This will be absolutely amazing!
@NAThompson Would you be up for some benchmarking on fftw vs xfft? I expect it to be slower but I'm curious by how much. I think it is valuable to have at any rate.
@spectre-ns : Could not complete the benchmarking:
~/t/build ❯❯❯ ./fftw_vs_xtensor
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2022-04-30T10:08:54-07:00
Running ./fftw_vs_xtensor
Run on (8 X 24.1209 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x8)
L1 Instruction 128 KiB (x8)
L2 Unified 4096 KiB (x2)
Load Average: 0.95, 1.18, 1.53
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dd00 overflowed to 0x00016bb0dcf8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
/usr/local/include/xtensor/xstrides.hpp:440:44: runtime error: signed integer overflow: 6101720352 * 6101720287 cannot be represented in type 'long'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstrides.hpp:440:44 in
/usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63 in
=================================================================
==3440==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016bb0dd90 at pc 0x00010431f454 bp 0x00016bb0ac10 sp 0x00016bb0ac08
READ of size 8 at 0x00016bb0dd90 thread T0
#0 0x10431f450 in decltype((std::__1::forward<unsigned long&>(fp)) * (std::__1::forward<unsigned long const&>(fp0))) std::__1::multiplies<void>::operator()<unsigned long&, unsigned long const&>(unsigned long&, unsigned long const&) const operations.h:132
#1 0x104318ff8 in unsigned long std::__1::accumulate<unsigned long const*, unsigned long, std::__1::multiplies<void> >(unsigned long const*, unsigned long const*, unsigned long, std::__1::multiplies<void>) numeric:187
#2 0x104317bcc in xt::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>(xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&, unsigned long, unsigned long, unsigned long) xaxis_slice_iterator.hpp:156
Reproducer:
#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>
static void xtensor_fft(benchmark::State& state) {
std::random_device rd;
std::mt19937_64 mt(rd());
std::uniform_real_distribution<double> unif(0.01, 1);
size_t n = state.range(0);
xt::xarray<double> x(n);
for (auto& z : x) {
z = unif(rd);
}
for (auto _ : state) {
auto z = xt::fft::fft(x);
benchmark::DoNotOptimize(z);
x[0] += 8 * std::numeric_limits<double>::epsilon();
}
state.SetComplexityN(state.range(0));
}
BENCHMARK(xtensor_fft)->Range(2, 64000)->Complexity();
BENCHMARK_MAIN();
Zipfile with CMakeLists etc attached.
@spectre-ns : Could not complete the benchmarking:
~/t/build ❯❯❯ ./fftw_vs_xtensor Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory 2022-04-30T10:08:54-07:00 Running ./fftw_vs_xtensor Run on (8 X 24.1209 MHz CPU s) CPU Caches: L1 Data 64 KiB (x8) L1 Instruction 128 KiB (x8) L2 Unified 4096 KiB (x2) Load Average: 0.95, 1.18, 1.53 /usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in /usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016bb0dd00 overflowed to 0x00016bb0dcf8 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in /usr/local/include/xtensor/xstrides.hpp:440:44: runtime error: signed integer overflow: 6101720352 * 6101720287 cannot be represented in type 'long' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstrides.hpp:440:44 in /usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63: runtime error: addition of unsigned offset to 0x00016bb0dcc0 overflowed to 0x00016bb0dcb8 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xaxis_slice_iterator.hpp:155:63 in ================================================================= ==3440==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016bb0dd90 at pc 0x00010431f454 bp 0x00016bb0ac10 sp 0x00016bb0ac08 READ of size 8 at 0x00016bb0dd90 thread T0 #0 0x10431f450 in decltype((std::__1::forward<unsigned long&>(fp)) * (std::__1::forward<unsigned long const&>(fp0))) std::__1::multiplies<void>::operator()<unsigned long&, unsigned long const&>(unsigned long&, unsigned long const&) const operations.h:132 #1 0x104318ff8 in unsigned long std::__1::accumulate<unsigned long const*, unsigned long, std::__1::multiplies<void> >(unsigned long const*, unsigned long const*, unsigned long, std::__1::multiplies<void>) numeric:187 #2 0x104317bcc in xt::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>::xaxis_slice_iterator<xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&>(xt::xarray_container<xt::uvector<double, std::__1::allocator<double> >, (xt::layout_type)1, xt::svector<unsigned long, 4ul, std::__1::allocator<unsigned long>, true>, xt::xtensor_expression_tag>&, unsigned long, unsigned long, unsigned long) xaxis_slice_iterator.hpp:156
Reproducer:
#include <benchmark/benchmark.h> #include <iostream> #include <random> #include <xtensor/xfft.hpp> #include <fftw3.h> static void xtensor_fft(benchmark::State& state) { std::random_device rd; std::mt19937_64 mt(rd()); std::uniform_real_distribution<double> unif(0.01, 1); size_t n = state.range(0); xt::xarray<double> x(n); for (auto& z : x) { z = unif(rd); } for (auto _ : state) { auto z = xt::fft::fft(x); benchmark::DoNotOptimize(z); x[0] += 8 * std::numeric_limits<double>::epsilon(); } state.SetComplexityN(state.range(0)); } BENCHMARK(xtensor_fft)->Range(2, 64000)->Complexity(); BENCHMARK_MAIN();
Zipfile with CMakeLists etc attached.
@tdegeus @NAThompson Looks like you're running the program through a sanitizer? Those errors are propagating from inside xtensor. I think the library should be safe in terms of undefined behavior so looks like a bug in xtensor. Weird because it compiles and passes on Clang, GCC and MSVC...
@spectre-ns : Yeah, I added the -fsanitize=address -fsanitize=undefined
flags. I didn't go searching for these until I hit a segfault though.
@spectre-ns : Yeah, I added the
-fsanitize=address -fsanitize=undefined
flags. I didn't go searching for these until I hit a segfault though.
Os and compiler?
$ clang --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: arm64-apple-darwin21.4.0
$ clang --version Apple clang version 13.1.6 (clang-1316.0.21.2.3) Target: arm64-apple-darwin21.4.0
@NAThompson Thanks! I was able to reproduce on linux gcc as well. You're issue is you're trying to create a 1D array of n length but you're actually creating a 0D array with a value of n. The slice and fft functions are undefined for that case and therefore give you an error. Try building off this code.
std::random_device rd;
std::mt19937_64 mt(rd());
std::uniform_real_distribution<double> unif(0.01, 1);
std::size_t n = 1024;
auto x = xt::xarray<double>::from_shape({n});
auto z = xt::fft::fft(x);
I should add a check for minimum dimensionality.
@spectre-ns : Thanks, that fixed it! That check would certainly be nice for foolish noobs like myself.
Still have an UBSan error but arising from deep in the library:
/usr/local/include/xtensor/xstorage.hpp:910:16: runtime error: addition of unsigned offset to 0x00016b205cd0 overflowed to 0x00016b205cc8
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/xtensor/xstorage.hpp:910:16 in
Now we have another problem that the complexity is better fit by a cubic in length than NlogN:
~/t/build ❯❯❯ ./fftw_vs_xtensor
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2022-04-30T11:13:45-07:00
Running ./fftw_vs_xtensor
Run on (8 X 24.0835 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x8)
L1 Instruction 128 KiB (x8)
L2 Unified 4096 KiB (x2)
Load Average: 1.59, 1.42, 1.33
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
xtensor_fft/2 910 ns 910 ns 723694
xtensor_fft/8 1192 ns 1192 ns 586682
xtensor_fft/64 4431 ns 4431 ns 158191
xtensor_fft/512 31718 ns 31718 ns 22044
xtensor_fft/4096 277507 ns 277507 ns 2516
xtensor_fft/32768 2545581 ns 2545607 ns 275
xtensor_fft/64000 43173807 ns 43171625 ns 16
xtensor_fft_BigO 0.00 N^3 0.00 N^3
xtensor_fft_RMS 19 % 19 %
fftw_r2c/2 6.66 ns 6.66 ns 105179331
fftw_r2c/8 9.21 ns 9.21 ns 75995267
fftw_r2c/64 70.2 ns 70.2 ns 9977337
fftw_r2c/512 977 ns 977 ns 717537
fftw_r2c/4096 9807 ns 9807 ns 71279
fftw_r2c/32768 238726 ns 238726 ns 2928
fftw_r2c/64000 300668 ns 300668 ns 2323
fftw_r2c_BigO 5.23 N 5.23 N
fftw_r2c_RMS 37 % 37 %
That's proper slow. lol
Try C2C as the xtensor version treats the real case with the general complex solution.
@spectre-ns : Is this only nlogn for power of 2 lengths?
@spectre-ns : Is this only nlogn for power of 2 lengths?
Yeah the radix2 algorithm relies on factors of 2. It will be way slower in the general case as it decomposes non-factors of 2 into a circular convolution requiring 2 radix2 ffts steps at the next largest N^2 value. FFT for non-2^N sizes is really slow.
FFT for non-2^N sizes is really slow.
Hmm. My measurements of FFTW show that power of 3 FFTs are faster than power of 2:
fftw_r2c/2 6.65 ns 6.65 ns 105007351
fftw_r2c/4 7.17 ns 7.17 ns 97609951
fftw_r2c/8 9.23 ns 9.23 ns 75937558
fftw_r2c/16 15.4 ns 15.4 ns 45567280
fftw_r2c/32 32.8 ns 32.8 ns 21302431
fftw_r2c/64 70.1 ns 70.1 ns 9988299
fftw_r2c/128 165 ns 165 ns 4234725
fftw_r2c/256 538 ns 538 ns 1340637
fftw_r2c/512 975 ns 975 ns 717573
fftw_r2c/1024 2130 ns 2130 ns 328754
fftw_r2c/2048 4499 ns 4499 ns 155592
fftw_r2c/4096 9823 ns 9823 ns 71161
fftw_r2c/8192 24030 ns 24030 ns 29139
fftw_r2c/16384 53946 ns 53946 ns 12966
fftw_r2c/32768 238757 ns 238757 ns 2931
fftw_r2c/65536 513781 ns 513782 ns 1364
fftw_r2c/131072 1050938 ns 1050962 ns 665
fftw_r2c/262144 2357824 ns 2354483 ns 298
fftw_r2c_BigO 0.49 NlgN 0.49 NlgN
fftw_r2c_RMS 9 % 9 %
fftw_r2c/3 9.09 ns 8.90 ns 95773646
fftw_r2c/9 14.6 ns 14.5 ns 48483169
fftw_r2c/27 115 ns 114 ns 6033339
fftw_r2c/81 282 ns 280 ns 2483388
fftw_r2c/243 1032 ns 1027 ns 684938
fftw_r2c/729 2509 ns 2465 ns 248990
fftw_r2c/2187 9890 ns 9884 ns 71227
fftw_r2c/6561 28191 ns 28150 ns 25171
fftw_r2c/19683 111384 ns 111384 ns 6275
fftw_r2c/59049 323147 ns 323147 ns 2167
fftw_r2c/177147 1170404 ns 1170412 ns 590
fftw_r2c/300000 1937187 ns 1937190 ns 358
fftw_r2c_BigO 0.36 NlgN 0.36 NlgN
fftw_r2c_RMS 6 % 6 %
FFT for non-2^N sizes is really slow.
Hmm. My measurements of FFTW show that power of 3 FFTs are faster than power of 2:
fftw_r2c/2 6.65 ns 6.65 ns 105007351 fftw_r2c/4 7.17 ns 7.17 ns 97609951 fftw_r2c/8 9.23 ns 9.23 ns 75937558 fftw_r2c/16 15.4 ns 15.4 ns 45567280 fftw_r2c/32 32.8 ns 32.8 ns 21302431 fftw_r2c/64 70.1 ns 70.1 ns 9988299 fftw_r2c/128 165 ns 165 ns 4234725 fftw_r2c/256 538 ns 538 ns 1340637 fftw_r2c/512 975 ns 975 ns 717573 fftw_r2c/1024 2130 ns 2130 ns 328754 fftw_r2c/2048 4499 ns 4499 ns 155592 fftw_r2c/4096 9823 ns 9823 ns 71161 fftw_r2c/8192 24030 ns 24030 ns 29139 fftw_r2c/16384 53946 ns 53946 ns 12966 fftw_r2c/32768 238757 ns 238757 ns 2931 fftw_r2c/65536 513781 ns 513782 ns 1364 fftw_r2c/131072 1050938 ns 1050962 ns 665 fftw_r2c/262144 2357824 ns 2354483 ns 298 fftw_r2c_BigO 0.49 NlgN 0.49 NlgN fftw_r2c_RMS 9 % 9 % fftw_r2c/3 9.09 ns 8.90 ns 95773646 fftw_r2c/9 14.6 ns 14.5 ns 48483169 fftw_r2c/27 115 ns 114 ns 6033339 fftw_r2c/81 282 ns 280 ns 2483388 fftw_r2c/243 1032 ns 1027 ns 684938 fftw_r2c/729 2509 ns 2465 ns 248990 fftw_r2c/2187 9890 ns 9884 ns 71227 fftw_r2c/6561 28191 ns 28150 ns 25171 fftw_r2c/19683 111384 ns 111384 ns 6275 fftw_r2c/59049 323147 ns 323147 ns 2167 fftw_r2c/177147 1170404 ns 1170412 ns 590 fftw_r2c/300000 1937187 ns 1937190 ns 358 fftw_r2c_BigO 0.36 NlgN 0.36 NlgN fftw_r2c_RMS 6 % 6 %
Yeah you'll get a different optimization for r2c vs c2c. Try c2c and see what you get.
I have now improved the benchmark:
#include <benchmark/benchmark.h>
#include <iostream>
#include <random>
#include <xtensor/xfft.hpp>
#include <fftw3.h>
static void xtensor_fft(benchmark::State& state) {
std::random_device rd;
std::mt19937_64 mt(rd());
std::uniform_real_distribution<double> unif(0.01, 1);
size_t n = state.range(0);
auto x = xt::xarray<double>::from_shape({n});
for (auto& z : x) {
z = unif(rd);
}
for (auto _ : state) {
auto z = xt::fft::fft(x);
benchmark::DoNotOptimize(z);
x[0] += 8 * std::numeric_limits<double>::epsilon();
}
state.SetComplexityN(state.range(0));
}
BENCHMARK(xtensor_fft)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(xtensor_fft)->RangeMultiplier(3)->Range(3, 300000)->Complexity();
static void fftw_r2c(benchmark::State& state) {
std::random_device rd;
std::mt19937_64 mt(rd());
std::uniform_real_distribution<double> unif(-1.0, 1.0);
size_t n = state.range(0);
std::vector<double> x(n);
for (auto & z : x) {
z = unif(rd);
}
auto gamma = fftw_alloc_complex(n);
fftw_plan plan = fftw_plan_dft_r2c_1d(n, x.data(), gamma, FFTW_ESTIMATE);
if (!plan) {
throw std::domain_error("Could not create r2c fftw plan");
}
for (auto _ : state) {
fftw_execute(plan);
benchmark::DoNotOptimize(x[0]);
}
fftw_destroy_plan(plan);
state.SetComplexityN(state.range(0));
}
BENCHMARK(fftw_r2c)->RangeMultiplier(2)->Range(2, 262144)->Complexity();
BENCHMARK(fftw_r2c)->RangeMultiplier(3)->Range(3, 300000)->Complexity();
BENCHMARK_MAIN();
This gives the following data:
xtensor_fft/2 913 ns 913 ns 715505
xtensor_fft/4 1007 ns 1007 ns 694686
xtensor_fft/8 1190 ns 1190 ns 589628
xtensor_fft/16 1632 ns 1632 ns 434181
xtensor_fft/32 2572 ns 2572 ns 270873
xtensor_fft/64 4438 ns 4438 ns 157378
xtensor_fft/128 8143 ns 8143 ns 85501
xtensor_fft/256 15798 ns 15798 ns 44050
xtensor_fft/512 31404 ns 31404 ns 22260
xtensor_fft/1024 64495 ns 64495 ns 10844
xtensor_fft/2048 134408 ns 134408 ns 5203
xtensor_fft/4096 278958 ns 278959 ns 2507
xtensor_fft/8192 582005 ns 582005 ns 1203
xtensor_fft/16384 1202766 ns 1202758 ns 582
xtensor_fft/32768 2506735 ns 2506735 ns 279
xtensor_fft/65536 5284727 ns 5284689 ns 132
xtensor_fft/131072 11285511 ns 11285419 ns 62
xtensor_fft/262144 23902101 ns 23902103 ns 29
xtensor_fft_BigO 5.07 NlgN 5.07 NlgN
xtensor_fft_RMS 1 % 1 %
xtensor_fft/3 6845 ns 6845 ns 102105
xtensor_fft/9 12424 ns 12424 ns 56418
xtensor_fft/27 20209 ns 20209 ns 34255
xtensor_fft/81 62942 ns 62942 ns 11103
xtensor_fft/243 125608 ns 125608 ns 5568
xtensor_fft/729 501351 ns 501350 ns 1392
xtensor_fft/2187 2104784 ns 2104783 ns 332
xtensor_fft/6561 4683574 ns 4683577 ns 149
xtensor_fft/19683 19299995 ns 19300028 ns 36
xtensor_fft/59049 42841047 ns 42841062 ns 16
xtensor_fft/177147 188836823 ns 188836750 ns 4
xtensor_fft/300000 399064125 ns 399064000 ns 2
xtensor_fft_BigO 69.67 NlgN 69.67 NlgN
xtensor_fft_RMS 21 % 21 %
fftw_r2c/2 6.65 ns 6.65 ns 105007351
fftw_r2c/4 7.17 ns 7.17 ns 97609951
fftw_r2c/8 9.23 ns 9.23 ns 75937558
fftw_r2c/16 15.4 ns 15.4 ns 45567280
fftw_r2c/32 32.8 ns 32.8 ns 21302431
fftw_r2c/64 70.1 ns 70.1 ns 9988299
fftw_r2c/128 165 ns 165 ns 4234725
fftw_r2c/256 538 ns 538 ns 1340637
fftw_r2c/512 975 ns 975 ns 717573
fftw_r2c/1024 2130 ns 2130 ns 328754
fftw_r2c/2048 4499 ns 4499 ns 155592
fftw_r2c/4096 9823 ns 9823 ns 71161
fftw_r2c/8192 24030 ns 24030 ns 29139
fftw_r2c/16384 53946 ns 53946 ns 12966
fftw_r2c/32768 238757 ns 238757 ns 2931
fftw_r2c/65536 513781 ns 513782 ns 1364
fftw_r2c/131072 1050938 ns 1050962 ns 665
fftw_r2c/262144 2357824 ns 2354483 ns 298
fftw_r2c_BigO 0.49 NlgN 0.49 NlgN
fftw_r2c_RMS 9 % 9 %
fftw_r2c/3 9.09 ns 8.90 ns 95773646
fftw_r2c/9 14.6 ns 14.5 ns 48483169
fftw_r2c/27 115 ns 114 ns 6033339
fftw_r2c/81 282 ns 280 ns 2483388
fftw_r2c/243 1032 ns 1027 ns 684938
fftw_r2c/729 2509 ns 2465 ns 248990
fftw_r2c/2187 9890 ns 9884 ns 71227
fftw_r2c/6561 28191 ns 28150 ns 25171
fftw_r2c/19683 111384 ns 111384 ns 6275
fftw_r2c/59049 323147 ns 323147 ns 2167
fftw_r2c/177147 1170404 ns 1170412 ns 590
fftw_r2c/300000 1937187 ns 1937190 ns 358
fftw_r2c_BigO 0.36 NlgN 0.36 NlgN
fftw_r2c_RMS 6 % 6 %
Really it looks like the length-2 and length-3 case could be sped up quite a bit without too much effort and I think you could get competitive.
I have now improved the benchmark:
#include <benchmark/benchmark.h> #include <iostream> #include <random> #include <xtensor/xfft.hpp> #include <fftw3.h> static void xtensor_fft(benchmark::State& state) { std::random_device rd; std::mt19937_64 mt(rd()); std::uniform_real_distribution<double> unif(0.01, 1); size_t n = state.range(0); auto x = xt::xarray<double>::from_shape({n}); for (auto& z : x) { z = unif(rd); } for (auto _ : state) { auto z = xt::fft::fft(x); benchmark::DoNotOptimize(z); x[0] += 8 * std::numeric_limits<double>::epsilon(); } state.SetComplexityN(state.range(0)); } BENCHMARK(xtensor_fft)->RangeMultiplier(2)->Range(2, 262144)->Complexity(); BENCHMARK(xtensor_fft)->RangeMultiplier(3)->Range(3, 300000)->Complexity(); static void fftw_r2c(benchmark::State& state) { std::random_device rd; std::mt19937_64 mt(rd()); std::uniform_real_distribution<double> unif(-1.0, 1.0); size_t n = state.range(0); std::vector<double> x(n); for (auto & z : x) { z = unif(rd); } auto gamma = fftw_alloc_complex(n); fftw_plan plan = fftw_plan_dft_r2c_1d(n, x.data(), gamma, FFTW_ESTIMATE); if (!plan) { throw std::domain_error("Could not create r2c fftw plan"); } for (auto _ : state) { fftw_execute(plan); benchmark::DoNotOptimize(x[0]); } fftw_destroy_plan(plan); state.SetComplexityN(state.range(0)); } BENCHMARK(fftw_r2c)->RangeMultiplier(2)->Range(2, 262144)->Complexity(); BENCHMARK(fftw_r2c)->RangeMultiplier(3)->Range(3, 300000)->Complexity(); BENCHMARK_MAIN();
This gives the following data:
xtensor_fft/2 913 ns 913 ns 715505 xtensor_fft/4 1007 ns 1007 ns 694686 xtensor_fft/8 1190 ns 1190 ns 589628 xtensor_fft/16 1632 ns 1632 ns 434181 xtensor_fft/32 2572 ns 2572 ns 270873 xtensor_fft/64 4438 ns 4438 ns 157378 xtensor_fft/128 8143 ns 8143 ns 85501 xtensor_fft/256 15798 ns 15798 ns 44050 xtensor_fft/512 31404 ns 31404 ns 22260 xtensor_fft/1024 64495 ns 64495 ns 10844 xtensor_fft/2048 134408 ns 134408 ns 5203 xtensor_fft/4096 278958 ns 278959 ns 2507 xtensor_fft/8192 582005 ns 582005 ns 1203 xtensor_fft/16384 1202766 ns 1202758 ns 582 xtensor_fft/32768 2506735 ns 2506735 ns 279 xtensor_fft/65536 5284727 ns 5284689 ns 132 xtensor_fft/131072 11285511 ns 11285419 ns 62 xtensor_fft/262144 23902101 ns 23902103 ns 29 xtensor_fft_BigO 5.07 NlgN 5.07 NlgN xtensor_fft_RMS 1 % 1 % xtensor_fft/3 6845 ns 6845 ns 102105 xtensor_fft/9 12424 ns 12424 ns 56418 xtensor_fft/27 20209 ns 20209 ns 34255 xtensor_fft/81 62942 ns 62942 ns 11103 xtensor_fft/243 125608 ns 125608 ns 5568 xtensor_fft/729 501351 ns 501350 ns 1392 xtensor_fft/2187 2104784 ns 2104783 ns 332 xtensor_fft/6561 4683574 ns 4683577 ns 149 xtensor_fft/19683 19299995 ns 19300028 ns 36 xtensor_fft/59049 42841047 ns 42841062 ns 16 xtensor_fft/177147 188836823 ns 188836750 ns 4 xtensor_fft/300000 399064125 ns 399064000 ns 2 xtensor_fft_BigO 69.67 NlgN 69.67 NlgN xtensor_fft_RMS 21 % 21 % fftw_r2c/2 6.65 ns 6.65 ns 105007351 fftw_r2c/4 7.17 ns 7.17 ns 97609951 fftw_r2c/8 9.23 ns 9.23 ns 75937558 fftw_r2c/16 15.4 ns 15.4 ns 45567280 fftw_r2c/32 32.8 ns 32.8 ns 21302431 fftw_r2c/64 70.1 ns 70.1 ns 9988299 fftw_r2c/128 165 ns 165 ns 4234725 fftw_r2c/256 538 ns 538 ns 1340637 fftw_r2c/512 975 ns 975 ns 717573 fftw_r2c/1024 2130 ns 2130 ns 328754 fftw_r2c/2048 4499 ns 4499 ns 155592 fftw_r2c/4096 9823 ns 9823 ns 71161 fftw_r2c/8192 24030 ns 24030 ns 29139 fftw_r2c/16384 53946 ns 53946 ns 12966 fftw_r2c/32768 238757 ns 238757 ns 2931 fftw_r2c/65536 513781 ns 513782 ns 1364 fftw_r2c/131072 1050938 ns 1050962 ns 665 fftw_r2c/262144 2357824 ns 2354483 ns 298 fftw_r2c_BigO 0.49 NlgN 0.49 NlgN fftw_r2c_RMS 9 % 9 % fftw_r2c/3 9.09 ns 8.90 ns 95773646 fftw_r2c/9 14.6 ns 14.5 ns 48483169 fftw_r2c/27 115 ns 114 ns 6033339 fftw_r2c/81 282 ns 280 ns 2483388 fftw_r2c/243 1032 ns 1027 ns 684938 fftw_r2c/729 2509 ns 2465 ns 248990 fftw_r2c/2187 9890 ns 9884 ns 71227 fftw_r2c/6561 28191 ns 28150 ns 25171 fftw_r2c/19683 111384 ns 111384 ns 6275 fftw_r2c/59049 323147 ns 323147 ns 2167 fftw_r2c/177147 1170404 ns 1170412 ns 590 fftw_r2c/300000 1937187 ns 1937190 ns 358 fftw_r2c_BigO 0.36 NlgN 0.36 NlgN fftw_r2c_RMS 6 % 6 %
Really it looks like the length-2 and length-3 case could be sped up quite a bit without too much effort and I think you could get competitive.
Looks like I'm way slower... I don't think I'm going to be anywhere close to competitive.
@NAThompson You can probably give a nice little speed boost by enabling xsimd.