sph methods slow?
Maybe this is totally expected, but:
- A lot slower than FFT. It has been a while, but I seem to recall (going back to 2018) that the first version was much more comparable to FFT
- Allocations that don't go away with in-place (hard to make an efficient time evolution algorithm in this case)
- Which method is the bottleneck depends on size
using FastTransforms
using FFTW
using LinearAlgebra
n,m = 1024,2047
x = randn(ComplexF64,n,m)
# FFTW
@time fft(x);
0.033259 seconds (259 allocations: 32.001 MiB)
P! = plan_fft!(x);
@time P!*x;
0.035154 seconds (232 allocations: 14.969 KiB)
# Spherical
P = plan_sph2fourier(x)
PS = plan_sph_synthesis(x)
@time PS*(P*x);
0.235185 seconds (549 allocations: 127.972 MiB, 1.57% gc time)
@time P*x; # slow
0.145820 seconds (8 allocations: 63.969 MiB, 1.85% gc time)
@time PS*x;
0.051036 seconds (545 allocations: 64.003 MiB)
@time lmul!(P,x); # slow
0.139856 seconds (4 allocations: 31.984 MiB, 0.91% gc time)
@time lmul!(PS,x);
0.047418 seconds (539 allocations: 32.018 MiB)
## small array
n,m = 60,121
x = randn(ComplexF64,n,m);
# FFTW
@time fft(x);
0.000560 seconds (242 allocations: 129.836 KiB)
P! = plan_fft!(x);
@time P!*x;
0.000411 seconds (218 allocations: 14.492 KiB)
# Spherical
P = plan_sph2fourier(x);
PS = plan_sph_synthesis(x);
@time PS*(P*x);
0.101827 seconds (19.31 k allocations: 1.626 MiB)
@time P*x;
0.000266 seconds (8 allocations: 227.406 KiB)
@time PS*x; # slow
0.031964 seconds (20.48 k allocations: 1.440 MiB)
@time lmul!(P,x);
0.000364 seconds (4 allocations: 113.594 KiB)
@time lmul!(PS,x); # slow
0.037324 seconds (20.38 k allocations: 1.326 MiB)
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.4.0)
CPU: 10 × Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_PKG_DEVDIR = /Users/abradley/Dropbox/Julia/Dev
JULIA_NUM_THREADS = 8
JULIA_EDITOR = code
JULIA_PKG_SERVER = us-west.pkg.julialang.org
I'd have to check later, but my first guesses are that the allocations come from using a complex array. (You can also use complex spherical harmonics as spinsph with spin 0 which are designed for complex arrays from the beginning.)
The architecture of M1 macs is different from the Intel macs so they use a different binary with different levels of optimization. You can double check it's using all cores, and can also try compiling from source with different optimization flags, but that's about it for now I think.
Thanks for suggesting spinsph which fixes allocations nicely.
This is what I get on an intel mac. For large arrays its similar, roughly 4-5 times longer than FFTW
However, the small array handling is a lot better, closer to factor of 2
using FastTransforms
using FFTW
using LinearAlgebra
n,m = 1024,2047
x = randn(ComplexF64,n,m)
## FFTW
@time fft(x);
P! = plan_fft!(x);
@time P!*x; # 0.031843 seconds (231 allocations: 14.219 KiB)
## Spherical
P = plan_sph2fourier(x)
PS = plan_sph_synthesis(x)
@time PS*(P*x); # 0.165899 seconds (557 allocations: 127.970 MiB, 6.21% gc time)
@time P*x; # slow 0.114058 seconds (8 allocations: 63.969 MiB, 5.08% gc time)
@time PS*x; # 0.056220 seconds (545 allocations: 64.001 MiB, 11.33% gc time)
@time lmul!(P,x); # slow 0.094217 seconds (4 allocations: 31.984 MiB)
@time lmul!(PS,x); # 0.045611 seconds (535 allocations: 32.016 MiB)
## SpinSph
P = plan_spinsph2fourier(x,0)
PS = plan_spinsph_synthesis(x,0)
@time PS*(P*x); # 0.175541 seconds (545 allocations: 64.001 MiB, 3.77% gc time)
@time P*x; # slow 0.129498 seconds (2 allocations: 31.984 MiB)
@time PS*x; # 0.039490 seconds (529 allocations: 32.016 MiB)
@time lmul!(P,x); # slow 0.127229 seconds
@time lmul!(PS,x); # 0.033086 seconds (538 allocations: 32.344 KiB)
n,m = 60,121
x = randn(ComplexF64,n,m);
## FFTW
@time fft(x);
P! = plan_fft!(x);
@time P!*x; # 0.000676 seconds (219 allocations: 13.875 KiB)
## Spherical
P = plan_sph2fourier(x);
PS = plan_sph_synthesis(x);
@time PS*(P*x); # 0.013577 seconds (20.43 k allocations: 1.563 MiB)
@time P*x; # 0.000360 seconds (8 allocations: 227.406 KiB)
@time PS*x; # slow 0.013644 seconds (20.39 k allocations: 1.340 MiB)
@time lmul!(P,x); # 0.000267 seconds (4 allocations: 113.594 KiB)
@time lmul!(PS,x); # 0.011682 seconds (20.47 k allocations: 1.232 MiB)
## SpinSph
P = plan_spinsph2fourier(x,0)
PS = plan_spinsph_synthesis(x,0)
@time PS*(P*x); # 0.001753 seconds (439 allocations: 252.734 KiB)
@time P*x; # 0.000314 seconds (2 allocations: 113.484 KiB)
@time PS*x; # 0.001559 seconds (423 allocations: 138.812 KiB)
@time lmul!(P,x); # 0.000212 seconds
@time lmul!(PS,x); # 0.001416 seconds (428 allocations: 25.547 KiB)
julia> versioninfo() Julia Version 1.9.0-beta3 Commit 24204a73447 (2023-01-18 07:20 UTC) Platform Info: OS: macOS (x86_64-apple-darwin21.4.0) CPU: 16 × Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, skylake) Threads: 8 on 8 virtual cores Environment: JULIA_NUM_THREADS = 8 JULIA_PKG_DEVDIR = /Users/abradley/Dropbox/Julia/Dev JULIA_EDITOR = code
So there is definitely something funky going on in the real spherical harmonics, way too many allocations
using FastTransforms
using LinearAlgebra
# using Threads
N = 32
M = 2*N - 1
F = randn(N,M)
P = plan_sph2fourier(F)
PA = plan_sph_analysis(F)
PS = plan_sph_synthesis(F)
x = randn(ComplexF64,N,M)
cP = plan_spinsph2fourier(x,0)
cPA = plan_spinsph_analysis(x,0)
cPS = plan_spinsph_synthesis(x,0)
@time lmul!(PA, F); #0.006832 seconds (20.88 k allocations: 1.240 MiB)
@time ldiv!(P, F); #0.000374 seconds
@time lmul!(P, F); #0.000503 seconds
@time lmul!(PS, F); #0.007914 seconds (20.88 k allocations: 1.240 MiB)
@time lmul!(cPA, x); #0.000613 seconds (324 allocations: 21.094 KiB)
@time ldiv!(cP, x); # 0.000877 seconds
@time lmul!(cP, x); #0.000830 seconds
@time lmul!(cPS, x); #0.000585 seconds (324 allocations: 21.094 KiB)