CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

CUFFT: support for arbitrary dims

Open whymusticode opened this issue 5 years ago • 18 comments

A 1d fft across the nth dimension of > n dimensional CuArray (where n is not 1) is not enabled by the wrapper (ERROR: ArgumentError: batching dims must be sequential from CuArrays>wXQp8>src>fft>wrappers.jl)

to reproduce: using CuArrays, CUDAnative, CUDAdrv, FFTW dim = 2 data = CuArrays.rand(ComplexF32,512,512,512); myFft = plan_fft!(data,dim);

the wrapper for regular cpu arrays allows this, and if dim is 1 or 3 it also works as expected

whymusticode avatar Sep 20 '19 02:09 whymusticode

I'm running into this issue as well.

using CuArrays, FFTW
a = cu(rand(ComplexF32, 16, 8, 4, 32))
p = plan_fft!(a, 2)

The error seems to apply to plan_fft as well.

More generally it appears that the dims argument has to (1) include the first or last dimension and (2) be sequential. For example with the previous 4D array, the following fail:

p = plan_fft!(a, 2)
p = plan_fft!(a, 3)
p = plan_fft!(a, [2,3])
p = plan_fft!(a, [1,3])

and the following are successful

p = plan_fft!(a, 1)
p = plan_fft!(a, [1,2])
p = plan_fft!(a, 4)
p = plan_fft!(a, [2,3,4])

This is a problem because permutedims!(src, dims) doesn't exist, so changing the order of dimensions requires an expensive copy operation.

btmit avatar May 19 '20 01:05 btmit

Is this a limitation of the CUFFT library?

From looking at the CUFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html) it wasn't super clear to me whether e.g. an FFT along dimension 2 for a 3D CuArray is possible, although it would probably involve a call to cufftPlanMany() which says that

With this function, batched plans of 1, 2, or 3 dimensions may be created.

~~but maybe it's possible with clever use of the advanced data layout parameters?~~ @kburns pointed out that this is the equivalent of the FFTW "advanced" interface, not the full "guru" interface which supports arbitrary loops of strided FFTs. So indeed it seems to be a limitation of the CUFFT library (and of the MKL FFT library).

If FFTs along non-batched dimensions are not supported by CUFFT then shouldn't this issue be closed as "can't fix"?

I'm happy to help work on this issue as it would help resolve a number of important Oceananigans.jl issues (https://github.com/CliMA/Oceananigans.jl/issues/586, https://github.com/CliMA/Oceananigans.jl/issues/1007) and speed up channel models (an important class of simulations) by a factor of ~2x due to the 2D DCT I had to implement in https://github.com/CliMA/Oceananigans.jl/pull/290.

If CUFFT cannot be used along non-batched dimensions, it seems like it should be possible to implement with permutedims!(dest::CuArray, src::CuArray, dims)? Although I'm not sure whether this functionality should enter CUDA.jl.

I think @btmit was thinking of a permutedims!(src::CuArray, dims) function but I don't think such a function is possible if you want efficiency for all possible permutations (due to overlapping memory between input and output, similar to Base.permutedims!)?


For reference, permutedims! on a 256^3 CuArray{Complex{Float64}} takes ~1.8 ms on a V100. So it's relatively fast operation but could introduce a performance hit, especially if you use a lot of FFTs, and could require allocating at least one extra array.

julia> @show typeof(src), typeof(dest);
(typeof(src), typeof(dest)) = (CuArray{Complex{Float64},3}, CuArray{Complex{Float64},3})

julia> @show size(src), size(dest);
(size(src), size(dest)) = ((256, 256, 256), (256, 256, 256))

julia> @benchmark CUDA.@sync permutedims!(dest, src, (2, 1, 3))
BenchmarkTools.Trial: 
  memory estimate:  800 bytes
  allocs estimate:  34
  --------------
  minimum time:     1.828 ms (0.00% GC)
  median time:      1.971 ms (0.00% GC)
  mean time:        2.103 ms (0.00% GC)
  maximum time:     37.842 ms (0.00% GC)
  --------------
  samples:          2371
  evals/sample:     1

ali-ramadhan avatar Oct 25 '20 14:10 ali-ramadhan

Since this is a limitation of the CUFFT library, it seems that transposing the array then performing an FFT along a non-batched dimension is the way to go.

E.g. to compute an FFT along dimension 2 of an array A with 3 dimensions, you would permutedims!(B, A, (2, 1, 3)) then compute an FFT along dimension 1 of B, then permutedims!(A, B, (2, 1, 3)).

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

ali-ramadhan avatar Nov 29 '20 16:11 ali-ramadhan

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

I guess; I'm not familiar with CUFFT or the fft API.

Seeing permutedims mentioned here, I had a quick look at optimizing its implementation. It's far from optimal, but some quick improvement is up at https://github.com/JuliaGPU/GPUArrays.jl/pull/338.

maleadt avatar Nov 30 '20 08:11 maleadt

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

The problem I found with permutedims is that it can't be done inplace, so this approach isn't an option for large arrays.

btmit avatar Nov 30 '20 15:11 btmit

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

Ah I'm definitely not an expert in FFTs but can you see how to do an FFT along dimension 2 for an array with 3 dimensions with cufftPlanMany()?

ali-ramadhan avatar Nov 30 '20 15:11 ali-ramadhan

I will try and figure it out as soon as I have a minute. I think I might be able to track down some old code.

As another data point, MATLAB CUDA can do this. Though there is a chance that they are doing a permutedim copy under the hood as well.

btmit avatar Nov 30 '20 15:11 btmit

I apologize for disappearing on this. It's a busy time of the year.

From: https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout

istride: Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension

idist: Indicates the distance between the first element of two consecutive signals in a batch of the input data

Let's take the example of an fft across the 2nd dimension only of an array of dimension 3. In this case istride is size[1], and idist is size[1]*size[2]

Generalizing this, it seems like we could support strided ffts so long as the dimensions are sequential (no skips).

btmit avatar Dec 18 '20 15:12 btmit

My understanding (if this behaves similarly to FFTW) is that that would only do FFTs along the 2nd dimension in the plane corresponding to index 1 in the 1st dimension (the istride here is skipping over the other elements along the 1st dimension, and idist is essentially looping over indices in the 3rd dimension). To apply an FFT along the 2nd dimension of the full array, you would have to do an outer loop applying this plan along each index in the 1st dimension (and if the plan can be reused, like it can on the CPU, this may be faster than doing a permutation).

On the FFTW side, this is why the guru interface exists -- to do multidimensional loops of transforms. But from https://docs.nvidia.com/cuda/cufft/index.html#fftw-supported-interface, it seems that this is not supported in cuFFT (the entry for "Guru Complex DFTs" lists multidimensional transform batches as "Unsupported").

kburns avatar Dec 18 '20 16:12 kburns

Here is some code, which can be used to transform along a single (internal) dimension:

function fft!(arr::CuArray, d::Int)
    if d>1 && d < ndims(arr)
        front_ids = ntuple((dd)->Colon(), d)
        ids = ntuple((dd)->ifelse(dd <= d,Colon(),1), ndims(arr))
        p = plan_fft!((@view arr[ids...]), d)
        for c in CartesianIndices(size(arr)[d+1:end])
            p * @view arr[front_ids..., Tuple(c)...]
        end
    else
        CUDA.CUFFT.fft!(arr, d)
    end
end

It is not ideal, but at least it does not do any extra GPU allocations. It is on my computer about 2x slower than the FFT along the outer dimensions. Ideally the loop over the cartesian indices should be replace by a call to cufftPlanMany() (or the PencilFFTs mentioned above?). This code can also be extended to work with multiple inner dimensions as atuple, by simply transforming these dimensions sequantially.

RainerHeintzmann avatar May 03 '23 10:05 RainerHeintzmann

... turns out it is even slightly worse. FFTing data with larger than 3 dims crashes:

julia> fca = fft(CuArray(rand(4,4,4,4)));
ERROR: CUFFTError: user specified an invalid pointer or parameter (code 4, CUFFT_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:11
  [2] macro expansion
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:24 [inlined]
  [3] cufftMakePlanMany(plan::Int32, rank::Int64, n::Vector{Int32}, inembed::Ptr{Nothing}, istride::Int64, idist::Int64, onembed::Ptr{Nothing}, ostride::Int64, odist::Int64, type::CUDA.CUFFT.cufftType_t, batch::Int64, workSize::Base.RefValue{UInt64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\call.jl:26
  [4] cufftMakePlan(xtype::CUDA.CUFFT.cufftType_t, xdims::NTuple{4, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:37
  [5] #133
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:145 [inlined]
  [6] (::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})()
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:28
  [7] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, l::ReentrantLock)
    @ Base .\lock.jl:185
  [8] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})(f::CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:26
  [9] pop!(f::Function, cache::CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, key::Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:47
 [10] cufftGetPlan(::CUDA.CUFFT.cufftType_t, ::Vararg{Any})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:143
 [11] plan_fft(X::CuArray{ComplexF64, 4, CUDA.Mem.DeviceBuffer}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:163
 [12] fft
    @ C:\Users\pi96doc\.julia\packages\AbstractFFTs\0uOAT\src\definitions.jl:63 [inlined]
 [13] fft (repeats 2 times)
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:124 [inlined]
 [14] top-level scope
    @ REPL[12]:1
 [15] top-level scope
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\src\initialization.jl:171

Might try to give it a shot to fix these things...

RainerHeintzmann avatar May 09 '23 12:05 RainerHeintzmann

this should be mostly solved now by this pull request: https://github.com/JuliaGPU/CUDA.jl/pull/1903

RainerHeintzmann avatar May 12 '23 11:05 RainerHeintzmann

@ali-ramadhan @whymusticode what is the consensus here? Is permutedims + fft the way to go?

BTW, I would be grateful if somebody could provide a CUDA-compatible DCT implementation. :D

vpuri3 avatar Jul 10 '23 20:07 vpuri3

this should be mostly solved now by this pull request: #1903

I still receive the same error when attempting a 4D fft (even on the latest CUDA.jl release). Did your comment mean that the 4D fft should be working or just that some machinery is now in place to make this possible? I would be happy to help contribute to getting this functioning if someone can provide some initial guidance on what is needed. Thanks!

ancorso avatar Mar 26 '24 02:03 ancorso

Thanks for bringing this up. I will try to look into this.

RainerHeintzmann avatar Mar 26 '24 13:03 RainerHeintzmann

Yes, currenty only a total of up to 3 transform directions are supported. E.g. fca = fft(CuArray(rand(4,4,4,4,4)), (1,3,4)); transforming directions 1,3,4 out of the 5D array does work. Do you have a use-case of a 4D FFT being needed? I guess one could try to support it by sequentially calling transforms for more trailing directions (e.g. in-place for the result), but it would presumably mean more programming work. @maleadt, any opinion on this? Should the work be invested? Apparently there is also a new FFT library https://github.com/PaulVirally/VkFFTCUDA.jl, which possibly may want to be supported?

RainerHeintzmann avatar Apr 01 '24 16:04 RainerHeintzmann

In support of investing the work (which I'm happy to help with if pointed in the right direction), I am using 4D FFTs in conjuction with the Fourier Neural Operator (https://github.com/SciML/NeuralOperators.jl) to model 4D PDE solutions (3 spatial dimensions, and time). This is a feature that is supported in PyTorch (https://pytorch.org/docs/stable/generated/torch.fft.fftn.html#torch.fft.fftn) so I think supporting an n-dimensional FFT would help improve feature parity between julia and python ML toolkits.

ancorso avatar Apr 01 '24 17:04 ancorso

@maleadt, any opinion on this?

Not really, sorry. I'm myself not a user of the CUDA FFT functionality.

maleadt avatar Apr 05 '24 11:04 maleadt