FFTW.jl
FFTW.jl copied to clipboard
Allocations when multi-threading
When activate multithreading by FFTW.set_num_threads(), it become slower and allocate more.
The provider here is the default "fftw".
But when using "mkl", it just use multithreading automatically depending on the problem size and allocation is 0.
Here are some benchmarks
data1=rand(ComplexF64,2^12)
p1=plan_fft!(data1)
julia> p1=plan_fft!(data1)
FFTW in-place forward plan for 4096-element array of ComplexF64
(dft-ct-dit/32
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dft-directbuf/130-128-x32 "n1fv_128_avx2"))
julia> @benchmark $p1*$data1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 21.800 μs … 68.700 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 23.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 23.189 μs ± 1.335 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁█
▂▂▂▄▄▅▅█▆███▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
21.8 μs Histogram: frequency by time 29.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> FFTW.set_num_threads(6)
julia> p1=plan_fft!(data1)
FFTW in-place forward plan for 4096-element array of ComplexF64
(dft-thr-ct-dit-x6/32
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dftw-direct-32/8 "t3fv_32_avx2_128")
(dft-directbuf/130-128-x32 "n1fv_128_avx2"))
julia> @benchmark $p1*$data1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 23.900 μs … 300.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 48.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 50.904 μs ± 12.291 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▂▄▆█▆▄▂ ▁ ▂▁▁
▂▁▂▂▂▂▂▂▂▃▃▃▃▅▇▇███████████████▇▆▅▅▅▅▅▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂ ▄
23.9 μs Histogram: frequency by time 83.5 μs <
Memory estimate: 4.41 KiB, allocs estimate: 53.
And
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 5 2600X Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 6
JULIA_PKG_SERVER = https://mirrors.tuna.tsinghua.edu.cn/julia
JULIA_EDITOR = code
- The allocation comes from julia's threading system. (FFTW backend use Julia's threading system, while MKL not.)
- IIRC, for single 1d-inplace fft, FFTW's doc has said multi-threading might not help.
Thanks for the information. Is there a way to change the threading system (not the provider) to let it allocation free?
Julia has https://github.com/JuliaSIMD/Polyester.jl, which should be able to replace https://github.com/JuliaMath/FFTW.jl/blob/17bc81a0fcf9875d777ea4bee2fca70fc23c8a0c/src/providers.jl#L56-L60 with
function spawnloop(f::Ptr{Cvoid}, fdata::Ptr{Cvoid}, elsize::Csize_t, num::Cint, callback_data::Ptr{Cvoid})
@batch per=core for i = 0:num-1
ccall(f, Ptr{Cvoid}, (Ptr{Cvoid},), fdata + elsize*i)
end
end
(I didn't test it, so don't except this to work!!!!!)
But I'm affraid this won't help the performance. FFT it self is much more heavy than the allocation.
Thank you very much! I'll test it later.
I can confirm this issue, I haven't noticed it in the past.
julia> using FFTW, BenchmarkTools
julia> x = randn((1024, 1024));
julia> FFTW.set_num_threads(8)
julia> @btime ifft($x);
7.646 ms (17810 allocations: 33.20 MiB)
julia> FFTW.set_num_threads(4)
julia> @btime ifft($x);
6.362 ms (115 allocations: 32.01 MiB)
julia> FFTW.set_num_threads(6)
julia> @btime ifft($x);
10.943 ms (38485 allocations: 34.27 MiB)