KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

Dynamic parallelism

Open simone-silvestri opened this issue 1 year ago • 1 comments

I am trying to set up a dynamic kernel wherein a KA kernel launches a CUDA kernel. The final objective would be to have dynamic parallelism using only kernel abstractions. This is a MWE showing the comparison between launching the parent kernel with CUDA or with KA

the child kernel

function child!(a)
    i = threadIdx().x
    @inbounds a[i] = i
    return nothing
end

CUDA implementation (runs)

function parent!(a)
    @cuda dynamic=true threads=10 blocks=1 child!(a)
    return nothing
end

a = CuArray(zeros(10))

kernel! = @cuda launch=false maxthreads=10 always_inline=true parent!(a)

kernel!(a; threads=1, blocks=1)

KA implementation

@kernel function parent!(a)
    @cuda dynamic=true threads=10 blocks=1 children!(a)
end

a = CuArray(zeros(10))

kernel! = parent!(CUDA.CUDABackend(), 1, 1)
 
kernel!(a)

returns

JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3299 }) }
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_#_#14_3295 }) }
JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3306 }) }
ERROR: a CUDA error was thrown during kernel execution: invalid configuration argument (code 9, cudaErrorInvalidConfiguration)
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] throw_device_cuerror at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:20
 [2] #launch#950 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:27
 [3] launch at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:65
 [4] #868 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:136
 [5] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:95
 [6] macro expansion at ./none:0
 [7] convert_arguments at ./none:0
 [8] #cudacall#867 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:135
 [9] cudacall at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:134
 [10] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:219
 [11] macro expansion at ./none:0
 [12] #call#1045 at ./none:0
 [13] call at ./none:0
 [14] #_#1061 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
 [15] DeviceKernel at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
 [16] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:88
 [17] macro expansion at /home/ssilvest/test.jl:46
 [18] gpu_parent! at /home/ssilvest/.julia/packages/KernelAbstractions/WoCk1/src/macros.jl:90
 [19] gpu_parent! at ./none:0

Is this expected? I guess it might be a problem of KA setting up maxthreads=1 in the kernel call

simone-silvestri avatar Dec 13 '23 15:12 simone-silvestri

Slightly confusing, so not expected.

In my experience dynamic parallelism doesn't have the best performance and of course we will need to figure out what it means for at least one different backend.

vchuravy avatar Dec 13 '23 18:12 vchuravy