CUDA.jl
CUDA.jl copied to clipboard
CUFFT: support for arbitrary dims
A 1d fft across the nth dimension of > n dimensional CuArray (where n is not 1) is not enabled by the wrapper (ERROR: ArgumentError: batching dims must be sequential from CuArrays>wXQp8>src>fft>wrappers.jl)
to reproduce: using CuArrays, CUDAnative, CUDAdrv, FFTW dim = 2 data = CuArrays.rand(ComplexF32,512,512,512); myFft = plan_fft!(data,dim);
the wrapper for regular cpu arrays allows this, and if dim is 1 or 3 it also works as expected
I'm running into this issue as well.
using CuArrays, FFTW
a = cu(rand(ComplexF32, 16, 8, 4, 32))
p = plan_fft!(a, 2)
The error seems to apply to plan_fft
as well.
More generally it appears that the dims
argument has to (1) include the first or last dimension and (2) be sequential. For example with the previous 4D array, the following fail:
p = plan_fft!(a, 2)
p = plan_fft!(a, 3)
p = plan_fft!(a, [2,3])
p = plan_fft!(a, [1,3])
and the following are successful
p = plan_fft!(a, 1)
p = plan_fft!(a, [1,2])
p = plan_fft!(a, 4)
p = plan_fft!(a, [2,3,4])
This is a problem because permutedims!(src, dims)
doesn't exist, so changing the order of dimensions requires an expensive copy operation.
Is this a limitation of the CUFFT library?
From looking at the CUFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html) it wasn't super clear to me whether e.g. an FFT along dimension 2 for a 3D CuArray
is possible, although it would probably involve a call to cufftPlanMany()
which says that
With this function, batched plans of 1, 2, or 3 dimensions may be created.
~~but maybe it's possible with clever use of the advanced data layout parameters?~~ @kburns pointed out that this is the equivalent of the FFTW "advanced" interface, not the full "guru" interface which supports arbitrary loops of strided FFTs. So indeed it seems to be a limitation of the CUFFT library (and of the MKL FFT library).
If FFTs along non-batched dimensions are not supported by CUFFT then shouldn't this issue be closed as "can't fix"?
I'm happy to help work on this issue as it would help resolve a number of important Oceananigans.jl issues (https://github.com/CliMA/Oceananigans.jl/issues/586, https://github.com/CliMA/Oceananigans.jl/issues/1007) and speed up channel models (an important class of simulations) by a factor of ~2x due to the 2D DCT I had to implement in https://github.com/CliMA/Oceananigans.jl/pull/290.
If CUFFT cannot be used along non-batched dimensions, it seems like it should be possible to implement with permutedims!(dest::CuArray, src::CuArray, dims)
? Although I'm not sure whether this functionality should enter CUDA.jl.
I think @btmit was thinking of a permutedims!(src::CuArray, dims)
function but I don't think such a function is possible if you want efficiency for all possible permutations (due to overlapping memory between input and output, similar to Base.permutedims!
)?
For reference, permutedims!
on a 256^3 CuArray{Complex{Float64}}
takes ~1.8 ms on a V100. So it's relatively fast operation but could introduce a performance hit, especially if you use a lot of FFTs, and could require allocating at least one extra array.
julia> @show typeof(src), typeof(dest);
(typeof(src), typeof(dest)) = (CuArray{Complex{Float64},3}, CuArray{Complex{Float64},3})
julia> @show size(src), size(dest);
(size(src), size(dest)) = ((256, 256, 256), (256, 256, 256))
julia> @benchmark CUDA.@sync permutedims!(dest, src, (2, 1, 3))
BenchmarkTools.Trial:
memory estimate: 800 bytes
allocs estimate: 34
--------------
minimum time: 1.828 ms (0.00% GC)
median time: 1.971 ms (0.00% GC)
mean time: 2.103 ms (0.00% GC)
maximum time: 37.842 ms (0.00% GC)
--------------
samples: 2371
evals/sample: 1
Since this is a limitation of the CUFFT library, it seems that transposing the array then performing an FFT along a non-batched dimension is the way to go.
E.g. to compute an FFT along dimension 2 of an array A
with 3 dimensions, you would permutedims!(B, A, (2, 1, 3))
then compute an FFT along dimension 1 of B
, then permutedims!(A, B, (2, 1, 3))
.
Should this issue be closed since it's can't really be fixed? It's not really a bug either.
Should this issue be closed since it's can't really be fixed? It's not really a bug either.
I guess; I'm not familiar with CUFFT or the fft API.
Seeing permutedims
mentioned here, I had a quick look at optimizing its implementation. It's far from optimal, but some quick improvement is up at https://github.com/JuliaGPU/GPUArrays.jl/pull/338.
I believe you can do this with the various options in cufftPlanMany()
in CUFFT.
The problem I found with permutedims
is that it can't be done inplace, so this approach isn't an option for large arrays.
I believe you can do this with the various options in
cufftPlanMany()
in CUFFT.
Ah I'm definitely not an expert in FFTs but can you see how to do an FFT along dimension 2 for an array with 3 dimensions with cufftPlanMany()
?
I will try and figure it out as soon as I have a minute. I think I might be able to track down some old code.
As another data point, MATLAB CUDA can do this. Though there is a chance that they are doing a permutedim copy under the hood as well.
I apologize for disappearing on this. It's a busy time of the year.
From: https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout
istride
: Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
idist
: Indicates the distance between the first element of two consecutive signals in a batch of the input data
Let's take the example of an fft across the 2nd dimension only of an array of dimension 3. In this case istride
is size[1]
, and idist
is size[1]*size[2]
Generalizing this, it seems like we could support strided ffts so long as the dimensions are sequential (no skips).
My understanding (if this behaves similarly to FFTW) is that that would only do FFTs along the 2nd dimension in the plane corresponding to index 1 in the 1st dimension (the istride
here is skipping over the other elements along the 1st dimension, and idist
is essentially looping over indices in the 3rd dimension). To apply an FFT along the 2nd dimension of the full array, you would have to do an outer loop applying this plan along each index in the 1st dimension (and if the plan can be reused, like it can on the CPU, this may be faster than doing a permutation).
On the FFTW side, this is why the guru
interface exists -- to do multidimensional loops of transforms. But from https://docs.nvidia.com/cuda/cufft/index.html#fftw-supported-interface, it seems that this is not supported in cuFFT (the entry for "Guru Complex DFTs" lists multidimensional transform batches as "Unsupported").
Here is some code, which can be used to transform along a single (internal) dimension:
function fft!(arr::CuArray, d::Int)
if d>1 && d < ndims(arr)
front_ids = ntuple((dd)->Colon(), d)
ids = ntuple((dd)->ifelse(dd <= d,Colon(),1), ndims(arr))
p = plan_fft!((@view arr[ids...]), d)
for c in CartesianIndices(size(arr)[d+1:end])
p * @view arr[front_ids..., Tuple(c)...]
end
else
CUDA.CUFFT.fft!(arr, d)
end
end
It is not ideal, but at least it does not do any extra GPU allocations. It is on my computer about 2x slower than the FFT along the outer dimensions. Ideally the loop over the cartesian indices should be replace by a call to cufftPlanMany()
(or the PencilFFTs mentioned above?). This code can also be extended to work with multiple inner dimensions as atuple, by simply transforming these dimensions sequantially.
... turns out it is even slightly worse. FFTing data with larger than 3 dims crashes:
julia> fca = fft(CuArray(rand(4,4,4,4)));
ERROR: CUFFTError: user specified an invalid pointer or parameter (code 4, CUFFT_INVALID_VALUE)
Stacktrace:
[1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:11
[2] macro expansion
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:24 [inlined]
[3] cufftMakePlanMany(plan::Int32, rank::Int64, n::Vector{Int32}, inembed::Ptr{Nothing}, istride::Int64, idist::Int64, onembed::Ptr{Nothing}, ostride::Int64, odist::Int64, type::CUDA.CUFFT.cufftType_t, batch::Int64, workSize::Base.RefValue{UInt64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\call.jl:26
[4] cufftMakePlan(xtype::CUDA.CUFFT.cufftType_t, xdims::NTuple{4, Int64}, region::UnitRange{Int64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:37
[5] #133
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:145 [inlined]
[6] (::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})()
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:28
[7] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, l::ReentrantLock)
@ Base .\lock.jl:185
[8] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})(f::CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:26
[9] pop!(f::Function, cache::CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, key::Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}})
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:47
[10] cufftGetPlan(::CUDA.CUFFT.cufftType_t, ::Vararg{Any})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:143
[11] plan_fft(X::CuArray{ComplexF64, 4, CUDA.Mem.DeviceBuffer}, region::UnitRange{Int64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:163
[12] fft
@ C:\Users\pi96doc\.julia\packages\AbstractFFTs\0uOAT\src\definitions.jl:63 [inlined]
[13] fft (repeats 2 times)
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:124 [inlined]
[14] top-level scope
@ REPL[12]:1
[15] top-level scope
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\src\initialization.jl:171
Might try to give it a shot to fix these things...
this should be mostly solved now by this pull request: https://github.com/JuliaGPU/CUDA.jl/pull/1903
@ali-ramadhan @whymusticode what is the consensus here? Is permutedims
+ fft
the way to go?
BTW, I would be grateful if somebody could provide a CUDA-compatible DCT implementation. :D
this should be mostly solved now by this pull request: #1903
I still receive the same error when attempting a 4D fft (even on the latest CUDA.jl release). Did your comment mean that the 4D fft should be working or just that some machinery is now in place to make this possible? I would be happy to help contribute to getting this functioning if someone can provide some initial guidance on what is needed. Thanks!
Thanks for bringing this up. I will try to look into this.
Yes, currenty only a total of up to 3 transform directions are supported. E.g.
fca = fft(CuArray(rand(4,4,4,4,4)), (1,3,4));
transforming directions 1,3,4 out of the 5D array does work. Do you have a use-case of a 4D FFT being needed? I guess one could try to support it by sequentially calling transforms for more trailing directions (e.g. in-place for the result), but it would presumably mean more programming work. @maleadt, any opinion on this? Should the work be invested?
Apparently there is also a new FFT library https://github.com/PaulVirally/VkFFTCUDA.jl, which possibly may want to be supported?
In support of investing the work (which I'm happy to help with if pointed in the right direction), I am using 4D FFTs in conjuction with the Fourier Neural Operator (https://github.com/SciML/NeuralOperators.jl) to model 4D PDE solutions (3 spatial dimensions, and time). This is a feature that is supported in PyTorch (https://pytorch.org/docs/stable/generated/torch.fft.fftn.html#torch.fft.fftn) so I think supporting an n-dimensional FFT would help improve feature parity between julia and python ML toolkits.
@maleadt, any opinion on this?
Not really, sorry. I'm myself not a user of the CUDA FFT functionality.