KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
Add GPU reverse mode to EnzymeExt
@michel2323 's PR, but opening so we can have a place to discuss.
wmoses@beast:~/git/Enzyme.jl/KernelAbstractions.jl (enz_rev_gpu) $ ../julia-1.10.0-rc2/bin/julia --project reverse_gpu.jl
Custom rule GPU
TapeType = @NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}
kernels: Error During Test at /home/wmoses/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:31
Got exception outside of a @test
GPU compilation of MethodInstance for EnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, true)}, ::Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, ::Duplicated{CuDeviceVector{Float64, 1}}) failed
KernelError: passing and using non-bitstype argument
Argument 5 to your kernel function is of type Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, which is not isbits:
Stacktrace:
[1] check_invocation(job::GPUCompiler.CompilerJob)
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:92
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:123 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
[4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:121
[5] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
[6] compile
@ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
[7] #1075
@ ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:247 [inlined]
[8] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
[9] compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:246
[10] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
[11] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
[12] macro expansion
@ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:382 [inlined]
[13] macro expansion
@ ./lock.jl:267 [inlined]
[14] cufunction(f::typeof(EnzymeExt.aug_fwd), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, typeof(gpu_square!), Val{(false, false, true)}, Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, Duplicated{CuDeviceVector{Float64, 1}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
@ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:377
[15] macro expansion
@ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:104 [inlined]
[16] (::KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(EnzymeExt.aug_fwd)})(::Function, ::Vararg{Any}; ndrange::Tuple{Int64}, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/packages/CUDA/YIj5X/src/CUDAKernels.jl:118
[17] #augmented_primal#12
@ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:163
[18] augmented_primal
@ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:115 [inlined]
[19] square_caller
@ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:14 [inlined]
[20] square_caller
@ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0 [inlined]
[21] diffejulia_square_caller_3883_inner_1wrap
@ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0
[22] macro expansion
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5306 [inlined]
[23] enzyme_call
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4984 [inlined]
[24] CombinedAdjointThunk
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4926 [inlined]
[25] autodiff
@ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:215 [inlined]
[26] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
@ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:238
[27] autodiff
@ ~/git/Enzyme.jl/src/Enzyme.jl:224 [inlined]
[28] macro expansion
@ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:40 [inlined]
[29] macro expansion
@ ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
[30] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
@ Main ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:32
[31] top-level scope
@ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:64
[32] include(mod::Module, _path::String)
@ Base ./Base.jl:495
[33] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:318
[34] _start()
@ Base ./client.jl:552
Test Summary: | Error Total Time
kernels | 1 1 1m38.4s
@wsmoses Forward mode doesn't work anymore, which used to work when I started on this. I'm on latest Enzyme#main
out_fwd.log.
(KernelAbstractions) pkg> st
Project KernelAbstractions v0.9.15
Status `~/.julia/dev/KernelAbstractions/Project.toml`
[79e6a3ab] Adapt v4.0.1
[a9b6321e] Atomix v0.1.0
[052768ef] CUDA v5.2.0
[7da242da] Enzyme v0.11.13 `~/.julia/dev/Enzyme`
[1914dd2f] MacroTools v0.5.13
[aea7be01] PrecompileTools v1.2.0
[ae029012] Requires v1.3.0
[90137ffa] StaticArrays v1.9.1
[013be700] UnsafeAtomics v0.2.1
[d80eeb9a] UnsafeAtomicsLLVM v0.1.3
[7cc45869] Enzyme_jll v0.0.98+0 `../Enzyme_jll`
[b77e0a4c] InteractiveUtils
[37e2e46d] LinearAlgebra
[2f01184e] SparseArrays v1.10.0
[cf7118a7] UUIDs
I added the following allocate call:
subtape = allocate(CUDABackend(), TapeType, size(blocks(iterspace)))
Now with [email protected] and the artifact I get:
╰─$ julia --project=. reverse_gpu.jl
kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
Got exception outside of a @test
AssertionError: value_type(lhs_v) == value_type(rhs_v)
Stacktrace:
[1] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.SelectInst, offset::LLVM.ConstantInt, hasload::Bool)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:262
[2] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.BitCastInst, offset::LLVM.ConstantInt, hasload::Bool)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:223
[3] nodecayed_phis!(mod::LLVM.Module)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:278
[4] optimize!
@ ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:1334 [inlined]
[5] nested_codegen!(mode::Enzyme.API.CDerivativeMode, mod::LLVM.Module, funcspec::Core.MethodInstance, world::UInt64)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:1416
[6] enzyme_custom_common_rev(forward::Bool, B::LLVM.IRBuilder, orig::LLVM.CallInst, gutils::Enzyme.Compiler.GradientUtils, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tape::Nothing)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:567
[7] enzyme_custom_augfwd
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:886 [inlined]
[8] (::Enzyme.Compiler.var"#212#213")(B::Ptr{LLVM.API.LLVMOpaqueBuilder}, OrigCI::Ptr{LLVM.API.LLVMOpaqueValue}, gutils::Ptr{Nothing}, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tapeR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}})
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/llvmrules.jl:1139
[9] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{Enzyme.API.CDIFFE_TYPE}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{Nothing}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{Bool}, augmented::Ptr{Nothing}, atomicAdd::Bool)
@ Enzyme.API ~/.julia/packages/Enzyme/Dd2LU/src/api.jl:141
[10] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::Tuple{Bool, Bool, Bool}, returnPrimal::Bool, jlrules::Vector{String}, expectedTapeType::Type, loweredArgs::Set{Int64}, boxedArgs::Set{Int64})
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:3124
[11] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4756
[12] codegen
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4339 [inlined]
[13] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool) (repeats 2 times)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5351
[14] cached_compilation
@ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5385 [inlined]
[15] (::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})(ctx::LLVM.Context)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5451
[16] JuliaContext(f::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
[17] #s1056#505
@ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5403 [inlined]
[18] var"#s1056#505"(FA::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, ReturnPrimal::Any, ShadowInit::Any, World::Any, ABI::Any, ::Any, ::Type, ::Type, ::Type, tt::Any, ::Type, ::Type, ::Type, ::Type, ::Type, ::Any)
@ Enzyme.Compiler ./none:0
[19] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
@ Core ./boot.jl:602
[20] autodiff
@ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:209 [inlined]
[21] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
@ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:238
[22] autodiff
@ ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:224 [inlined]
[23] macro expansion
@ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
[24] macro expansion
@ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
[25] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
@ Main ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
[26] top-level scope
@ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64
With the latest Enzyme and Enzyme.jl I get this below in the call to https://github.com/JuliaGPU/KernelAbstractions.jl/blob/3c38fc7f56f36611c467893bcfdefad1b53a80eb/ext/CUDAEnzymeExt.jl#L54 .
[32421] signal (11.1): Segmentation fault
in expression starting at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64
typekeyvalue_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1622 [inlined]
lookup_typevalue at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1059
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2157
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:868 [inlined]
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:863
absint at /home/michel/.julia/dev/Enzyme/src/absint.jl:116
abs_typeof at /home/michel/.julia/dev/Enzyme/src/absint.jl:213
unknown function (ip: 0x7f48e19f5043)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:500
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:208
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:178
check_ir at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:157 [inlined]
#codegen#468 at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4382
codegen at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4346 [inlined]
#48 at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:672
JuliaContext at /home/michel/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
tape_type at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:671 [inlined]
#augmented_primal#4 at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:57
augmented_primal at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:14 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:13 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0 [inlined]
diffejulia_square_caller_3884_inner_1wrap at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0
macro expansion at /home/michel/.julia/dev/Enzyme/src/compiler.jl:5306 [inlined]
enzyme_call at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4984 [inlined]
CombinedAdjointThunk at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4926 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:215 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:238
unknown function (ip: 0x7f48e19edfba)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:224 [inlined]
macro expansion at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
macro expansion at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
enzyme_testsuite at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
unknown function (ip: 0x7f49504d5c9f)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46343.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82703.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7f4967759d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 223384695 (Pool: 223135920; Big: 248775); GC: 129
[1] 32421 segmentation fault julia --project=. reverse_gpu.jl
you should update Enzyme to latest (0.11.14)
The reverse kernel uses autodiff_deferred_thunk as opposed to the forward mode using autodiff_deferred. Indeed, there is no test for autodiff_deferred_thunk on CUDA in Enzyme.jl. Trying my luck, but not sure I'll figure it out.
kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
Got exception outside of a @test
InvalidIRError: compiling MethodInstance for CUDAEnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, false)}, ::CuDeviceVector{Float64, 1}, ::Duplicated{CuDeviceVector{Float64, 1}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to autodiff_deferred_thunk(::EnzymeCore.ReverseModeSplit{ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI}, ::Type{FA}, ::Type{A}, args...) where {FA<:Annotation, A<:Annotation, ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI<:ABI} @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:726)
Stacktrace:
[1] aug_fwd
@ ~/.julia/dev/KernelAbstractions/ext/enzyme_utils.jl:7
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:147
@vchuravy Cleaned up. Are we waiting for https://github.com/EnzymeAD/Enzyme.jl/pull/1104 and https://github.com/JuliaGPU/CUDA.jl/pull/2260 ?
Will need to change https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c5fe83c899b3fd29308564467c3a3722179bfe9d/Project.toml#L23 to only be 0.7.1
@michel2323 given that the prerequisites have landed, mind getting this over the finish line?
@wsmoses @vchuravy Cleanup with working tests (if CUDA is working). Last unresolved issue is active arguments to a kernel. The compiler cannot figure out the type here for the actives, so all actives are marked Any which then leads to a wrong return type.
https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c21f6bbf107a495c34fe746d5cca145869af7473/ext/EnzymeExt.jl#L334
I tried to fix it, but I'm not sure there's a way. So for now, it gracefully errors with https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c21f6bbf107a495c34fe746d5cca145869af7473/ext/EnzymeExt.jl#L259 in the augmented forward run.
Will need rebase for #478
@vchuravy Bump. Is there a blocker here?
The tests are a bit sparse and they should be enabled for more than the CPU backend?