CUFFT: Support Float16
To support Float16 or other, more generic scenarios, cuFFT plans need to be created via cufftXtMakePlanMany. I think this function provides a superset of all other plan-generating functions.
The main difference to the previous APIs is that the transformation is not described via a cufftType, but directly instead via its input and output types.
Are the libcufft.jl changes autogenerated? I don't see changes to the wrapper generators?
Ah, rats. That's what the This file is automatically generated. Do not edit! refers to!
Yep. But it's probably as easy as adding cufftXt.h to the list of headers that are parsed for libcufft.jl, https://github.com/JuliaGPU/CUDA.jl/blob/a90cba132c3da588e0c70955525e5d1d3f2a4c81/res/wrap/wrap.jl#L288-L290, and correcting argument types (Ptr -> CuPtr where referring to device memory) in the database, https://github.com/JuliaGPU/CUDA.jl/blob/master/res/wrap/cufft.toml
Thanks for the feedback and the pointers so far.
Current state: Trying to track down allocation errors
From worker 2: WARNING: Error while freeing DeviceMemory(128.000 KiB at 0x000000193e084800):
From worker 2: CUDA.CuError(code=CUDA.cudaError_enum(0x000002cf))
From worker 2:
From worker 2: Stacktrace:
From worker 2: [1] throw_api_error(res::CUDA.cudaError_enum)
From worker 2: @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:30
From worker 2: [2] check
From worker 2: @ ~/.julia/dev/CUDA/lib/cudadrv/libcuda.jl:37 [inlined]
From worker 2: [3] cuMemFreeAsync
From worker 2: @ ~/.julia/dev/CUDA/lib/utils/call.jl:34 [inlined]
This is ready for a review.
I understand that the changes are larger than expected. I essentially removed all (internal) support for the previous cufftType, which required explicit code paths for all input types (single/double precision) and transform types (c2c, c2r, r2c). I switch to using the new Xt interface which unifies this. A transform is now specified purely by its input and output type.
This required changing how the plans for the AbstractFFTs interface are represented. There is now a single type CuFFTPlan with two type parameters, T (output) and S (input). Overall the code has become simpler.
CI fails because it times out before it reaches the CUFFT tests.
CI fails because it times out before it reaches the CUFFT tests.
That seems suspicious... CI doesn't print which tests it's working in, so given that the master branch works fine, I'd suspect the hang being in cuFFT.
Hangs locally too:
signal (10): User defined signal 1
unknown function (ip: 0x7e2b226e7489)
__pthread_rwlock_wrlock at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e2af4aa9903)
unknown function (ip: 0x7e2af4727e93)
unknown function (ip: 0x7e2af48602b8)
macro expansion at /home/tim/Julia/pkg/CUDA/lib/utils/call.jl:218 [inlined]
unchecked_cuModuleLoadDataEx at /home/tim/Julia/pkg/CUDA/lib/cudadrv/libcuda.jl:3445 [inlined]
#952 at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:25
retry_reclaim at /home/tim/Julia/pkg/CUDA/src/memory.jl:434 [inlined]
checked_cuModuleLoadDataEx at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:24
CuModule at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:60
CuModule at /home/tim/Julia/pkg/CUDA/lib/cudadrv/module.jl:49 [inlined]
link at /home/tim/Julia/pkg/CUDA/src/compiler/compilation.jl:413
jfptr_link_14432 at /home/tim/.julia/compiled/v1.10/CUDA/oWw5k_kRtLK.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
actual_compilation at /home/tim/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:134
unknown function (ip: 0x7e2b07ff4a39)
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
cached_compilation at /home/tim/.julia/packages/GPUCompiler/nWT2N/src/execution.jl:103
macro expansion at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:369 [inlined]
macro expansion at ./lock.jl:267 [inlined]
#cufunction#1171 at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:364
cufunction at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:361 [inlined]
macro expansion at /home/tim/Julia/pkg/CUDA/src/compiler/execution.jl:112 [inlined]
#launch_heuristic#1204 at /home/tim/Julia/pkg/CUDA/src/gpuarrays.jl:17 [inlined]
launch_heuristic at /home/tim/Julia/pkg/CUDA/src/gpuarrays.jl:15 [inlined]
_copyto! at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:78 [inlined]
copyto! at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:44 [inlined]
copy at /home/tim/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:29 [inlined]
materialize at ./broadcast.jl:903 [inlined]
broadcast at ./broadcast.jl:841
copy1 at /home/tim/Julia/pkg/CUDA/lib/cufft/util.jl:22
realfloat at /home/tim/Julia/pkg/CUDA/lib/cufft/util.jl:17 [inlined]
plan_rfft at /home/tim/Julia/pkg/CUDA/lib/cufft/fft.jl:123 [inlined]
#plan_rfft#7 at /home/tim/.julia/packages/AbstractFFTs/4iQz5/src/definitions.jl:68 [inlined]
plan_rfft at /home/tim/.julia/packages/AbstractFFTs/4iQz5/src/definitions.jl:68 [inlined]
out_of_place at /home/tim/Julia/pkg/CUDA/test/libraries/cufft.jl:385
unknown function (ip: 0x7e2b07b7f595)
The culprit seems like a very short stack overflow?
LoadError: StackOverflowError:
Stacktrace:
[1] launch(::CuFunction, ::CUDA.KernelState, ::CuDeviceVector{ComplexF64, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{ComplexF64, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CuStream)
@ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:73
[2] launch
@ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:52 [inlined]
[3] #972
@ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:189 [inlined]
[4] macro expansion
@ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:149 [inlined]
[5] macro expansion
@ ./none:0 [inlined]
[6] convert_arguments
@ ./none:0 [inlined]
[7] #cudacall#971
@ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:191 [inlined]
[8] cudacall
@ ~/Julia/pkg/CUDA/lib/cudadrv/execution.jl:187 [inlined]
[9] macro expansion
@ ~/Julia/pkg/CUDA/src/compiler/execution.jl:268 [inlined]
[10] macro expansion
@ ./none:0 [inlined]
[11] call
@ ./none:0 [inlined]
[12] (::CUDA.HostKernel{GPUArrays.var"#34#36", Tuple{CUDA.CuKernelContext, CuDeviceVector{ComplexF64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{ComplexF64, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})(::CUDA.CuKernelContext, ::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuArray{ComplexF64, 1, CUDA.DeviceMemory}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; threads::Int64, blocks::Int64, kwargs::@Kwargs{})
@ CUDA ~/Julia/pkg/CUDA/src/compiler/execution.jl:390
[13] HostKernel
@ ~/Julia/pkg/CUDA/src/compiler/execution.jl:389 [inlined]
[14] macro expansion
@ ~/Julia/pkg/CUDA/src/compiler/execution.jl:114 [inlined]
[15] #gpu_call#1205
@ ~/Julia/pkg/CUDA/src/gpuarrays.jl:30 [inlined]
[16] gpu_call
@ ~/Julia/pkg/CUDA/src/gpuarrays.jl:28 [inlined]
[17] gpu_call(::GPUArrays.var"#34#36", ::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, Tuple{Base.OneTo{Int64}}, CUDA.CUFFT.var"#112#113"{ComplexF64}, Tuple{Base.Broadcast.Extruded{CuArray{ComplexF64, 1, CUDA.DeviceMemory}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64; target::CuArray{ComplexF64, 1, CUDA.DeviceMemory}, elements::Nothing, threads::Int64, blocks::Int64, name::Nothing)
@ GPUArrays ~/.julia/packages/GPUArrays/8Y80U/src/device/execution.jl:69
[18] gpu_call
@ ~/.julia/packages/GPUArrays/8Y80U/src/device/execution.jl:34 [inlined]
[19] _copyto!
@ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:82 [inlined]
[20] copyto!
@ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:44 [inlined]
[21] copy
@ ~/.julia/packages/GPUArrays/8Y80U/src/host/broadcast.jl:29 [inlined]
[22] materialize
@ ./broadcast.jl:903 [inlined]
[23] broadcast(f::CUDA.CUFFT.var"#112#113"{ComplexF64}, As::CuArray{ComplexF64, 1, CUDA.DeviceMemory})
@ Base.Broadcast ./broadcast.jl:841
[24] copy1(::Type{ComplexF64}, x::CuArray{ComplexF64, 1, CUDA.DeviceMemory})
@ CUDA.CUFFT ~/Julia/pkg/CUDA/lib/cufft/util.jl:22
[25] *(p::CUDA.CUFFT.CuFFTPlan{ComplexF64, Float64, -1, false, 1}, x::CuArray{ComplexF64, 1, CUDA.DeviceMemory}) (repeats 79964 times)
@ CUDA.CUFFT ~/Julia/pkg/CUDA/lib/cufft/fft.jl:11
in expression starting at /home/tim/Julia/pkg/CUDA/test/libraries/cufft.jl:40
MWE:
using CUDA
using CUDA.CUFFT
import FFTW
using AbstractFFTs
X = rand(Int, 8)
fftw_X = rfft(X)
d_X = CuArray(X)
p = plan_rfft(d_X)
d_Y = p * d_X
Thanks. The problem was that the input array was converted to the output type of the plan instead of the expected input type. (Those differ for real->complex transforms.)
AFAIU, the Xt libraries are intended for multiGPU workloads (and sometimes post their own problems because of that, see e.g. https://github.com/JuliaGPU/CUDA.jl/issues/2320). The documentation doesn't seem to explicitly state that using cufftXt without cufftXtSetGPUs implies that it's otherwise equivalent to the single-GPU API. Or did I miss anything?
@maleadt I scoured the documentation for cuFFT (https://docs.nvidia.com/cuda/cufft/index.html). I find these statements:
- (2.7) "Every cuFFT plan may be associated with a CUDA stream."
- (2.8.1) "In the single GPU case a plan is created by a call to cufftCreate() followed by a call to cufftMakePlan*(). For multiple GPUs, the GPUs to use for execution are identified by a call to cufftXtSetGPUs() and this must occur after the call to cufftCreate() and prior to the call to cufftMakePlan*()."
That is, it seems that using multiple GPUs is possible with both cufftMakePlan and cufftXtMakePlan, and there is no relation between using multiple GPUs and using the Xt functions in cuFFT.
The problem to which you point seems to be related to memory allocation. I don't change this in my PR – memory continues to be allocated by creating CuArray objects.
The problem to which you point seems to be related to memory allocation. I don't change this in my PR – memory continues to be allocated by creating
CuArrayobjects.
That doesn't matter: The problem is that our memory allocations are asynchronous, and cublasXt accesses memory in a way that complicates use of asynchronous memory. I'm wary that cufftXt has the same issues, necessitating manual synchronization as noted in https://github.com/JuliaGPU/CUDA.jl/issues/2320#issuecomment-2133238480. It might be that this only manifests when only effectively using multiple GPUs, but I'd be more comfortable if there was a guarantee that single-GPU cufftXt calls are stream ordered.
I'll ping NVIDIA about this.
Alright, it was confirmed to me that cuFFT, it be through the Xt APIs or the regular ones, does identical ordering of operations against the stream set by the user. So this shouldn't pose any problem.