Audit uses of 32-bit indexing
We're currently using Int32 indices in some kernels, using the i32 hack, because that often results in significantly better performance. However, GPUs are getting large, and users are starting to use arrays that overflow typemax(Int32) elements. This can results in bugs like https://github.com/JuliaGPU/CUDA.jl/issues/1963
We should be more careful about using 32-bit indexing, and probably not use i32 until we have a better way of deciding which index type to use. Maybe we can add some kind of index_type trait, defaulting to Int but possibly using Int32 when the input arrays allow it, e.g., using https://github.com/JuliaGPU/CUDA.jl/pull/1895.
Dear CUDA.jl team, I would like to bump this issue. The last couple generations of GPUs (e.g. L40S, H100 and H200) have enough memory that they can handle >2B arrays.
Error 1 (broadcasting)
julia> using CUDA
julia> A = CUDA.fill(1f0, 2^32); A .= 2f0
ERROR: InexactError: trunc(Int32, 4294967296)
Stacktrace:
[1] throw_inexacterror(::Symbol, ::Vararg{Any})
@ Core ./boot.jl:750
[2] checked_trunc_sint
@ ./boot.jl:764 [inlined]
[3] toInt32
@ ./boot.jl:801 [inlined]
[4] Int32
@ ./boot.jl:891 [inlined]
[5] convert
@ ./number.jl:7 [inlined]
[6] cconvert
@ ./essentials.jl:687 [inlined]
[7] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:222 [inlined]
[8] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:5139 [inlined]
[9] #735
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:35 [inlined]
[10] check
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:35 [inlined]
[11] cuOccupancyMaxPotentialBlockSize
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
[12] launch_configuration(fun::CuFunction; shmem::Int64, max_threads::Int64)
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/occupancy.jl:61
[13] launch_configuration
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/occupancy.jl:56 [inlined]
[14] (::KernelAbstractions.Kernel{…})(::CuArray{…}, ::Vararg{…}; ndrange::Tuple{…}, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/packages/CUDA/1kIOw/src/CUDAKernels.jl:107
Error 2 (filling a large array, no explicit broadcasting)
julia> A = CUDA.fill(true, 2^32);
ERROR: InexactError: trunc(Int32, 4294967296)
Stacktrace:
[1] throw_inexacterror(::Symbol, ::Vararg{Any})
@ Core ./boot.jl:750
[2] checked_trunc_sint
@ ./boot.jl:764 [inlined]
[3] toInt32
@ ./boot.jl:801 [inlined]
[4] Int32
@ ./boot.jl:891 [inlined]
[5] convert
@ ./number.jl:7 [inlined]
[6] cconvert
@ ./essentials.jl:687 [inlined]
[7] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:222 [inlined]
[8] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:5139 [inlined]
[9] #735
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:35 [inlined]
[10] check
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:35 [inlined]
[11] cuOccupancyMaxPotentialBlockSize
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
[12] launch_configuration(fun::CuFunction; shmem::Int64, max_threads::Int64)
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/occupancy.jl:61
[13] launch_configuration
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/occupancy.jl:56 [inlined]
[14] (::KernelAbstractions.Kernel{…})(::CuArray{…}, ::Vararg{…}; ndrange::Tuple{…}, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/packages/CUDA/1kIOw/src/CUDAKernels.jl:107
[15] fill!(A::CuArray{Bool, 1, CUDA.DeviceMemory}, x::Bool)
@ GPUArrays ~/.julia/packages/GPUArrays/uiVyU/src/host/construction.jl:22
[16] fill
@ ~/.julia/packages/CUDA/1kIOw/src/array.jl:777 [inlined]
[17] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/src/utilities.jl:35 [inlined]
[18] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/src/memory.jl:831 [inlined]
[19] top-level scope
@ ./REPL[114]:1
Some type information was truncated. Use `show(err)` to see complete types.
EDIT: I believe this was fixed a couple of days ago, I'll wait for the next release and re-run my code.
As you noted, those issues are unrelated, and are fixed on the master branch.