CUDA.jl CUFFT: Support Float16

To support Float16 or other, more generic scenarios, cuFFT plans need to be created via cufftXtMakePlanMany. I think this function provides a superset of all other plan-generating functions.

The main difference to the previous APIs is that the transformation is not described via a cufftType, but directly instead via its input and output types.

Jul 01 '24 17:07 eschnett

Are the libcufft.jl changes autogenerated? I don't see changes to the wrapper generators?

Jul 03 '24 17:07 maleadt

Ah, rats. That's what the This file is automatically generated. Do not edit! refers to!

Jul 03 '24 20:07 eschnett

Yep. But it's probably as easy as adding cufftXt.h to the list of headers that are parsed for libcufft.jl, https://github.com/JuliaGPU/CUDA.jl/blob/a90cba132c3da588e0c70955525e5d1d3f2a4c81/res/wrap/wrap.jl#L288-L290, and correcting argument types (Ptr -> CuPtr where referring to device memory) in the database, https://github.com/JuliaGPU/CUDA.jl/blob/master/res/wrap/cufft.toml

Jul 04 '24 06:07 maleadt

Thanks for the feedback and the pointers so far.

Current state: Trying to track down allocation errors

      From worker 2:    WARNING: Error while freeing DeviceMemory(128.000 KiB at 0x000000193e084800):
      From worker 2:    CUDA.CuError(code=CUDA.cudaError_enum(0x000002cf))
      From worker 2:
      From worker 2:    Stacktrace:
      From worker 2:      [1] throw_api_error(res::CUDA.cudaError_enum)
      From worker 2:        @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:30
      From worker 2:      [2] check
      From worker 2:        @ ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:37 [inlined]
      From worker 2:      [3] cuMemFreeAsync
      From worker 2:        @ ~/.julia/dev/CUDA/lib/utils/call.jl:34 [inlined]

Jul 06 '24 21:07 eschnett

This is ready for a review.

I understand that the changes are larger than expected. I essentially removed all (internal) support for the previous cufftType, which required explicit code paths for all input types (single/double precision) and transform types (c2c, c2r, r2c). I switch to using the new Xt interface which unifies this. A transform is now specified purely by its input and output type.

This required changing how the plans for the AbstractFFTs interface are represented. There is now a single type CuFFTPlan with two type parameters, T (output) and S (input). Overall the code has become simpler.

Jul 09 '24 17:07 eschnett

CI fails because it times out before it reaches the CUFFT tests.

Jul 09 '24 19:07 eschnett

CI fails because it times out before it reaches the CUFFT tests.

That seems suspicious... CI doesn't print which tests it's working in, so given that the master branch works fine, I'd suspect the hang being in cuFFT.

Jul 10 '24 12:07 maleadt

Hangs locally too:

signal (10): User defined signal 1
unknown function (ip: 0x7e2b226e7489)
__pthread_rwlock_wrlock at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e2af4aa9903)
unknown function (ip: 0x7e2af4727e93)
unknown function (ip: 0x7e2af48602b8)
macro expansion at /home/tim/Julia/pkg/CUDA/lib/utils/call.jl:218 [inlined]
unchecked_cuModuleLoadDataEx at /home/tim/Julia/pkg/CUDA/lib/cudadrv/libcuda.jl:3445 [inlined]
#952 at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:25
retry_reclaim at /home/tim/Julia/pkg/CUDA/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:24
CuModule at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:60
CuModule at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:49 [inlined]
link at /home/tim/Julia/pkg/CUDA/src/compiler/compilation.jl:413
jfptr_link_14432 at /home/tim/.julia/compiled/v1.10/CUDA/oWw5k_kRtLK.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
actual_compilation at /home/tim/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:134
unknown function (ip: 0x7e2b07ff4a39)
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
cached_compilation at /home/tim/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
macro expansion at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:369 [inlined]
macro expansion at ./lock.jl:267 [inlined]
#cufunction#1171 at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:364
cufunction at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:361 [inlined]
macro expansion at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:112 [inlined]
#launch_heuristic#1204 at /home/tim/Julia/pkg/CUDA/src/gpuarrays.jl:17 [inlined]
launch_heuristic at /home/tim/Julia/pkg/CUDA/src/gpuarrays.jl:15 [inlined]
_copyto! at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:78 [inlined]
copyto! at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:44 [inlined]
copy at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:29 [inlined]
materialize at ./broadcast.jl:903 [inlined]
broadcast at ./broadcast.jl:841
copy1 at /home/tim/Julia/pkg/CUDA/lib/cufft/util.jl:22
realfloat at /home/tim/Julia/pkg/CUDA/lib/cufft/util.jl:17 [inlined]
plan_rfft at /home/tim/Julia/pkg/CUDA/lib/cufft/fft.jl:123 [inlined]
#plan_rfft#7 at /home/tim/.julia/packages/AbstractFFTs/4iQz5/src/definitions.jl:68 [inlined]
plan_rfft at /home/tim/.julia/packages/AbstractFFTs/4iQz5/src/definitions.jl:68 [inlined]
out_of_place at /home/tim/Julia/pkg/CUDA/test/libraries/cufft.jl:385
unknown function (ip: 0x7e2b07b7f595)

Jul 10 '24 12:07 maleadt

The culprit seems like a very short stack overflow?

  LoadError: StackOverflowError:
  Stacktrace:
    [1] launch(::CuFunction, ::CUDA.KernelState, ::CuDeviceVector{ComplexF64, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{ComplexF64, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CuStream)
      @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:73
    [2] launch
      @ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:52 [inlined]
    [3] #972
      @ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:189 [inlined]
    [4] macro expansion
      @ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:149 [inlined]
    [5] macro expansion
      @ ./none:0 [inlined]
    [6] convert_arguments
      @ ./none:0 [inlined]
    [7] #cudacall#971
      @ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:191 [inlined]
    [8] cudacall
      @ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:187 [inlined]
    [9] macro expansion
      @ ~/Julia/pkg/CUDA/src/compiler/execution.jl:268 [inlined]
   [10] macro expansion
      @ ./none:0 [inlined]
   [11] call
      @ ./none:0 [inlined]
   [12] (::CUDA.HostKernel{GPUArrays.var"#34#36", Tuple{CUDA.CuKernelContext, CuDeviceVector{ComplexF64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{ComplexF64, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})(::CUDA.CuKernelContext, ::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuArray{ComplexF64, 1, CUDA.DeviceMemory}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; threads::Int64, blocks::Int64, kwargs::@Kwargs{})
      @ CUDA ~/Julia/pkg/CUDA/src/compiler/execution.jl:390
   [13] HostKernel
      @ ~/Julia/pkg/CUDA/src/compiler/execution.jl:389 [inlined]
   [14] macro expansion
      @ ~/Julia/pkg/CUDA/src/compiler/execution.jl:114 [inlined]
   [15] #gpu_call#1205
      @ ~/Julia/pkg/CUDA/src/gpuarrays.jl:30 [inlined]
   [16] gpu_call
      @ ~/Julia/pkg/CUDA/src/gpuarrays.jl:28 [inlined]
   [17] gpu_call(::GPUArrays.var"#34#36", ::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuArray{ComplexF64, 1, CUDA.DeviceMemory}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; target::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, elements::Nothing, threads::Int64, blocks::Int64, name::Nothing)
      @ GPUArrays ~/.julia/packages/GPUArrays/8Y80U/src/device/execution.jl:69
   [18] gpu_call
      @ ~/.julia/packages/GPUArrays/8Y80U/src/device/execution.jl:34 [inlined]
   [19] _copyto!
      @ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:82 [inlined]
   [20] copyto!
      @ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:44 [inlined]
   [21] copy
      @ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:29 [inlined]
   [22] materialize
      @ ./broadcast.jl:903 [inlined]
   [23] broadcast(f::CUDA.CUFFT.var"#112#113"{ComplexF64}, As::CuArray{ComplexF64, 1, CUDA.DeviceMemory})
      @ Base.Broadcast ./broadcast.jl:841
   [24] copy1(::Type{ComplexF64}, x::CuArray{ComplexF64, 1, CUDA.DeviceMemory})
      @ CUDA.CUFFT ~/Julia/pkg/CUDA/lib/cufft/util.jl:22
   [25] *(p::CUDA.CUFFT.CuFFTPlan{ComplexF64, Float64, -1, false, 1}, x::CuArray{ComplexF64, 1, CUDA.DeviceMemory}) (repeats 79964 times)
      @ CUDA.CUFFT ~/Julia/pkg/CUDA/lib/cufft/fft.jl:11
  in expression starting at /home/tim/Julia/pkg/CUDA/test/libraries/cufft.jl:40

MWE:

using CUDA
using CUDA.CUFFT

import FFTW
using AbstractFFTs

X = rand(Int, 8)
fftw_X = rfft(X)
d_X = CuArray(X)
p = plan_rfft(d_X)
d_Y = p * d_X

Jul 10 '24 13:07 maleadt

Thanks. The problem was that the input array was converted to the output type of the plan instead of the expected input type. (Those differ for real->complex transforms.)

Jul 10 '24 15:07 eschnett

AFAIU, the Xt libraries are intended for multiGPU workloads (and sometimes post their own problems because of that, see e.g. https://github.com/JuliaGPU/CUDA.jl/issues/2320). The documentation doesn't seem to explicitly state that using cufftXt without cufftXtSetGPUs implies that it's otherwise equivalent to the single-GPU API. Or did I miss anything?

Jul 12 '24 10:07 maleadt

@maleadt I scoured the documentation for cuFFT (https://docs.nvidia.com/cuda/cufft/index.html). I find these statements:

(2.7) "Every cuFFT plan may be associated with a CUDA stream."
(2.8.1) "In the single GPU case a plan is created by a call to cufftCreate() followed by a call to cufftMakePlan*(). For multiple GPUs, the GPUs to use for execution are identified by a call to cufftXtSetGPUs() and this must occur after the call to cufftCreate() and prior to the call to cufftMakePlan*()."

That is, it seems that using multiple GPUs is possible with both cufftMakePlan and cufftXtMakePlan, and there is no relation between using multiple GPUs and using the Xt functions in cuFFT.

The problem to which you point seems to be related to memory allocation. I don't change this in my PR – memory continues to be allocated by creating CuArray objects.

Jul 12 '24 13:07 eschnett

The problem to which you point seems to be related to memory allocation. I don't change this in my PR – memory continues to be allocated by creating CuArray objects.

That doesn't matter: The problem is that our memory allocations are asynchronous, and cublasXt accesses memory in a way that complicates use of asynchronous memory. I'm wary that cufftXt has the same issues, necessitating manual synchronization as noted in https://github.com/JuliaGPU/CUDA.jl/issues/2320#issuecomment-2133238480. It might be that this only manifests when only effectively using multiple GPUs, but I'd be more comfortable if there was a guarantee that single-GPU cufftXt calls are stream ordered.

I'll ping NVIDIA about this.

Jul 12 '24 14:07 maleadt

Alright, it was confirmed to me that cuFFT, it be through the Xt APIs or the regular ones, does identical ordering of operations against the stream set by the user. So this shouldn't pose any problem.

Jul 18 '24 10:07 maleadt