KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

Add GPU reverse mode to EnzymeExt

Open wsmoses opened this issue 1 year ago • 10 comments

@michel2323 's PR, but opening so we can have a place to discuss.

wsmoses avatar Jan 24 '24 23:01 wsmoses

wmoses@beast:~/git/Enzyme.jl/KernelAbstractions.jl (enz_rev_gpu) $ ../julia-1.10.0-rc2/bin/julia --project reverse_gpu.jl 
Custom rule GPU
TapeType = @NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}
kernels: Error During Test at /home/wmoses/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:31
  Got exception outside of a @test
  GPU compilation of MethodInstance for EnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, true)}, ::Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, ::Duplicated{CuDeviceVector{Float64, 1}}) failed
  KernelError: passing and using non-bitstype argument
  
  Argument 5 to your kernel function is of type Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, which is not isbits:
  
  
  Stacktrace:
    [1] check_invocation(job::GPUCompiler.CompilerJob)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:92
    [2] macro expansion
      @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:123 [inlined]
    [3] macro expansion
      @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
    [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:121
    [5] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
    [6] compile
      @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
    [7] #1075
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:247 [inlined]
    [8] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
    [9] compile(job::GPUCompiler.CompilerJob)
      @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:246
   [10] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
   [11] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
   [12] macro expansion
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:382 [inlined]
   [13] macro expansion
      @ ./lock.jl:267 [inlined]
   [14] cufunction(f::typeof(EnzymeExt.aug_fwd), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, typeof(gpu_square!), Val{(false, false, true)}, Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, Duplicated{CuDeviceVector{Float64, 1}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
      @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:377
   [15] macro expansion
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:104 [inlined]
   [16] (::KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(EnzymeExt.aug_fwd)})(::Function, ::Vararg{Any}; ndrange::Tuple{Int64}, workgroupsize::Nothing)
      @ CUDA.CUDAKernels ~/.julia/packages/CUDA/YIj5X/src/CUDAKernels.jl:118
   [17] #augmented_primal#12
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:163
   [18] augmented_primal
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:115 [inlined]
   [19] square_caller
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:14 [inlined]
   [20] square_caller
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0 [inlined]
   [21] diffejulia_square_caller_3883_inner_1wrap
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0
   [22] macro expansion
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5306 [inlined]
   [23] enzyme_call
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4984 [inlined]
   [24] CombinedAdjointThunk
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4926 [inlined]
   [25] autodiff
      @ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:215 [inlined]
   [26] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
      @ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:238
   [27] autodiff
      @ ~/git/Enzyme.jl/src/Enzyme.jl:224 [inlined]
   [28] macro expansion
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:40 [inlined]
   [29] macro expansion
      @ ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [30] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
      @ Main ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:32
   [31] top-level scope
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:64
   [32] include(mod::Module, _path::String)
      @ Base ./Base.jl:495
   [33] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:318
   [34] _start()
      @ Base ./client.jl:552
Test Summary: | Error  Total     Time
kernels       |     1      1  1m38.4s

wsmoses avatar Jan 24 '24 23:01 wsmoses

@wsmoses Forward mode doesn't work anymore, which used to work when I started on this. I'm on latest Enzyme#main out_fwd.log.

(KernelAbstractions) pkg> st
Project KernelAbstractions v0.9.15
Status `~/.julia/dev/KernelAbstractions/Project.toml`
  [79e6a3ab] Adapt v4.0.1
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v5.2.0
  [7da242da] Enzyme v0.11.13 `~/.julia/dev/Enzyme`
  [1914dd2f] MacroTools v0.5.13
  [aea7be01] PrecompileTools v1.2.0
  [ae029012] Requires v1.3.0
  [90137ffa] StaticArrays v1.9.1
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.1.3
  [7cc45869] Enzyme_jll v0.0.98+0 `../Enzyme_jll`
  [b77e0a4c] InteractiveUtils
  [37e2e46d] LinearAlgebra
  [2f01184e] SparseArrays v1.10.0
  [cf7118a7] UUIDs

michel2323 avatar Jan 25 '24 16:01 michel2323

I added the following allocate call:

subtape = allocate(CUDABackend(), TapeType, size(blocks(iterspace)))

Now with [email protected] and the artifact I get:

╰─$ julia --project=. reverse_gpu.jl
kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
  Got exception outside of a @test
  AssertionError: value_type(lhs_v) == value_type(rhs_v)
  Stacktrace:
    [1] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.SelectInst, offset::LLVM.ConstantInt, hasload::Bool)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:262
    [2] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.BitCastInst, offset::LLVM.ConstantInt, hasload::Bool)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:223
    [3] nodecayed_phis!(mod::LLVM.Module)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:278
    [4] optimize!
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:1334 [inlined]
    [5] nested_codegen!(mode::Enzyme.API.CDerivativeMode, mod::LLVM.Module, funcspec::Core.MethodInstance, world::UInt64)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:1416
    [6] enzyme_custom_common_rev(forward::Bool, B::LLVM.IRBuilder, orig::LLVM.CallInst, gutils::Enzyme.Compiler.GradientUtils, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tape::Nothing)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:567
    [7] enzyme_custom_augfwd
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:886 [inlined]
    [8] (::Enzyme.Compiler.var"#212#213")(B::Ptr{LLVM.API.LLVMOpaqueBuilder}, OrigCI::Ptr{LLVM.API.LLVMOpaqueValue}, gutils::Ptr{Nothing}, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tapeR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}})
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/llvmrules.jl:1139
    [9] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{Enzyme.API.CDIFFE_TYPE}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{Nothing}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{Bool}, augmented::Ptr{Nothing}, atomicAdd::Bool)
      @ Enzyme.API ~/.julia/packages/Enzyme/Dd2LU/src/api.jl:141
   [10] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::Tuple{Bool, Bool, Bool}, returnPrimal::Bool, jlrules::Vector{String}, expectedTapeType::Type, loweredArgs::Set{Int64}, boxedArgs::Set{Int64})
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:3124
   [11] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4756
   [12] codegen
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4339 [inlined]
   [13] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool) (repeats 2 times)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5351
   [14] cached_compilation
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5385 [inlined]
   [15] (::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})(ctx::LLVM.Context)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5451
   [16] JuliaContext(f::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
   [17] #s1056#505
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5403 [inlined]
   [18] var"#s1056#505"(FA::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, ReturnPrimal::Any, ShadowInit::Any, World::Any, ABI::Any, ::Any, ::Type, ::Type, ::Type, tt::Any, ::Type, ::Type, ::Type, ::Type, ::Type, ::Any)
      @ Enzyme.Compiler ./none:0
   [19] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
      @ Core ./boot.jl:602
   [20] autodiff
      @ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:209 [inlined]
   [21] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
      @ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:238
   [22] autodiff
      @ ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:224 [inlined]
   [23] macro expansion
      @ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
   [24] macro expansion
      @ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [25] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
      @ Main ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
   [26] top-level scope
      @ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64

With the latest Enzyme and Enzyme.jl I get this below in the call to https://github.com/JuliaGPU/KernelAbstractions.jl/blob/3c38fc7f56f36611c467893bcfdefad1b53a80eb/ext/CUDAEnzymeExt.jl#L54 .

[32421] signal (11.1): Segmentation fault
in expression starting at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64
typekeyvalue_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1622 [inlined]
lookup_typevalue at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1059
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2157
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:868 [inlined]
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:863
absint at /home/michel/.julia/dev/Enzyme/src/absint.jl:116
abs_typeof at /home/michel/.julia/dev/Enzyme/src/absint.jl:213
unknown function (ip: 0x7f48e19f5043)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:500
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:208
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:178
check_ir at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:157 [inlined]
#codegen#468 at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4382
codegen at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4346 [inlined]
#48 at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:672
JuliaContext at /home/michel/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
tape_type at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:671 [inlined]
#augmented_primal#4 at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:57
augmented_primal at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:14 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:13 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0 [inlined]
diffejulia_square_caller_3884_inner_1wrap at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0
macro expansion at /home/michel/.julia/dev/Enzyme/src/compiler.jl:5306 [inlined]
enzyme_call at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4984 [inlined]
CombinedAdjointThunk at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4926 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:215 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:238
unknown function (ip: 0x7f48e19edfba)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:224 [inlined]
macro expansion at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
macro expansion at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
enzyme_testsuite at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
unknown function (ip: 0x7f49504d5c9f)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46343.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82703.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7f4967759d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 223384695 (Pool: 223135920; Big: 248775); GC: 129
[1]    32421 segmentation fault  julia --project=. reverse_gpu.jl

michel2323 avatar Jan 29 '24 20:01 michel2323

you should update Enzyme to latest (0.11.14)

wsmoses avatar Jan 29 '24 20:01 wsmoses

The reverse kernel uses autodiff_deferred_thunk as opposed to the forward mode using autodiff_deferred. Indeed, there is no test for autodiff_deferred_thunk on CUDA in Enzyme.jl. Trying my luck, but not sure I'll figure it out.

 kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
  Got exception outside of a @test
  InvalidIRError: compiling MethodInstance for CUDAEnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, false)}, ::CuDeviceVector{Float64, 1}, ::Duplicated{CuDeviceVector{Float64, 1}}) resulted in invalid LLVM IR
  Reason: unsupported dynamic function invocation (call to autodiff_deferred_thunk(::EnzymeCore.ReverseModeSplit{ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI}, ::Type{FA}, ::Type{A}, args...) where {FA<:Annotation, A<:Annotation, ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI<:ABI} @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:726)
  Stacktrace:
   [1] aug_fwd
     @ ~/.julia/dev/KernelAbstractions/ext/enzyme_utils.jl:7
  Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
  Stacktrace:
    [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:147

michel2323 avatar Feb 06 '24 17:02 michel2323

@vchuravy Cleaned up. Are we waiting for https://github.com/EnzymeAD/Enzyme.jl/pull/1104 and https://github.com/JuliaGPU/CUDA.jl/pull/2260 ?

michel2323 avatar Feb 27 '24 16:02 michel2323

Will need to change https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c5fe83c899b3fd29308564467c3a3722179bfe9d/Project.toml#L23 to only be 0.7.1

vchuravy avatar Apr 09 '24 15:04 vchuravy

@michel2323 given that the prerequisites have landed, mind getting this over the finish line?

wsmoses avatar May 11 '24 01:05 wsmoses

@wsmoses @vchuravy Cleanup with working tests (if CUDA is working). Last unresolved issue is active arguments to a kernel. The compiler cannot figure out the type here for the actives, so all actives are marked Any which then leads to a wrong return type.

https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c21f6bbf107a495c34fe746d5cca145869af7473/ext/EnzymeExt.jl#L334

I tried to fix it, but I'm not sure there's a way. So for now, it gracefully errors with https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c21f6bbf107a495c34fe746d5cca145869af7473/ext/EnzymeExt.jl#L259 in the augmented forward run.

michel2323 avatar May 31 '24 17:05 michel2323

Will need rebase for #478

vchuravy avatar Jun 06 '24 22:06 vchuravy

@vchuravy Bump. Is there a blocker here?

michel2323 avatar Jul 03 '24 15:07 michel2323

The tests are a bit sparse and they should be enabled for more than the CPU backend?

vchuravy avatar Jul 03 '24 17:07 vchuravy