Enzyme.jl icon indicating copy to clipboard operation
Enzyme.jl copied to clipboard

Error on function with views and function which takes arrays of arrays

Open DomCRose opened this issue 3 years ago • 9 comments

Hi, thanks for all your work on this package, and apologies in advance, its entirely likely I've messed up the autodiff call or am unaware of something that isn't supported yet.

Firstly, here are two examples based on views which error. In an earlier version p was a 4D array also indexed by the loop index, which produced a shorter error (with the difference in trace causing a different error between the two versions in that case), however, these current versions produce such a long error that the REPL/language server crashes.

using Enzyme

function contraction1(x, p, cache1, cache2)
    cache1 .= @view(p[x[1]+1, :, :])
    for i = 2:length(x)
        mul!(cache2, @view(p[x[i]+1, :, :]), cache1)
        cache1 .= cache2
    end
    trace = 0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = rand(d, χ, χ)
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction1, Const(x), Active(p),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

and

using Enzyme, LinearAlgebra
function contraction2(x, p, cache1, cache2)
    cache1 .= @view(p[x[1]+1, :, :])
    for i = 2:length(x)
        mul!(cache2, @view(p[x[i]+1, :, :]), cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = rand(d, χ, χ)
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction2, Const(x), Active(p),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

My original goal was to actua;ly use a vector/matrix of matrices for p and the cache (which will be much more complicated in the case I'm aiming for). These versions (with / without call to tr) are below, with different errors.

using Enzyme

function contraction3(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    trace = 0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = [rand(χ, χ) for i=1:d]
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction3, Const(x), Active(p),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

Error:

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
ERROR: AssertionError: length(args) == length((collect(parameters(entry_f)))[1 + sret + returnRoots:end])
Stacktrace:
  [1] lower_convention(functy::Type, mod::LLVM.Module, entry_f::LLVM.Function, actualRetType::Type)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3698
  [2] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction3), Tuple{Vector{Int64}, Vector{Matrix{Float64}}, Matrix{Float64}, Matrix{Float64}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4123
  [3] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction3), Tuple{Vector{Int64}, Vector{Matrix{Float64}}, Matrix{Float64}, Matrix{Float64}}}})
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4599
  [4] cached_compilation(job::GPUCompiler.CompilerJob, key::UInt64, specid::UInt64)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4637
  [5] #s565#115
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4697 [inlined]
  [6] var"#s565#115"(F::Any, Fn::Any, DF::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, specid::Any, ReturnPrimal::Any, ::Any, #unused#::Type, f::Any, df::Any, #unused#::Type, tt::Any, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Any)    
    @ Enzyme.Compiler .\none:0
  [7] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core .\boot.jl:580
  [8] thunk
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4725 [inlined]
  [9] thunk (repeats 2 times)
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4718 [inlined]
 [10] autodiff(::Enzyme.ReverseMode, ::typeof(contraction3), ::Type{Const{Union{Float64, Int64}}}, ::Const{Vector{Int64}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:285
 [11] autodiff(::Enzyme.ReverseMode, ::typeof(contraction3), ::Const{Vector{Int64}}, ::Active{Vector{Matrix{Float64}}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:319
 [12] top-level scope
    @ c:\Users\domin\Dropbox (Personal)\side_projects\AdaptiveTrajectorySampling\src\approximations\tensor_approx.jl:77
using Enzyme, LinearAlgebra

function contraction4(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = [rand(χ, χ) for i=1:d]
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction4, Const(x), Active(p),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

Error:

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
ERROR: Conversion of boxed type Vector{Matrix{Float64}} is not allowed
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] convert(::Type{LLVM.LLVMType}, typ::Type; ctx::LLVM.Context, allow_boxed::Bool)
    @ LLVM.Interop C:\Users\domin\.julia\packages\LLVM\WjSQG\src\interop\base.jl:92
  [3] create_abi_wrapper(enzymefn::LLVM.Function, F::Type, argtypes::Vector{DataType}, rettype::Type, actualRetType::Type, Mode::Enzyme.API.CDerivativeMode, augmented::Nothing, dupClosure::Bool, width::Int64, returnPrimal::Bool)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3345
  [4] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction4), Tuple{Vector{Int64}, Vector{Matrix{Float64}}, Matrix{Float64}, Matrix{Float64}}}}, mod::LLVM.Module, primalf::LLVM.Function, adjoint::GPUCompiler.FunctionSpec{typeof(contraction4), Tuple{Const{Vector{Int64}}, Active{Vector{Matrix{Float64}}}, Duplicated{Matrix{Float64}}, Duplicated{Matrix{Float64}}}}, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, dupClosure::Bool, wrap::Bool, modifiedBetween::Bool, returnPrimal::Bool)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3278
  [5] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction4), Tuple{Vector{Int64}, Vector{Matrix{Float64}}, Matrix{Float64}, Matrix{Float64}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4158
  [6] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction4), Tuple{Vector{Int64}, Vector{Matrix{Float64}}, Matrix{Float64}, Matrix{Float64}}}})
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4599
  [7] cached_compilation(job::GPUCompiler.CompilerJob, key::UInt64, specid::UInt64)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4637
  [8] #s565#115
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4697 [inlined]
  [9] var"#s565#115"(F::Any, Fn::Any, DF::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, specid::Any, ReturnPrimal::Any, ::Any, #unused#::Type, f::Any, df::Any, #unused#::Type, tt::Any, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Any)
    @ Enzyme.Compiler .\none:0
 [10] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core .\boot.jl:580
 [11] thunk
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4725 [inlined]
 [12] thunk (repeats 2 times)
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4718 [inlined]
 [13] autodiff(::Enzyme.ReverseMode, ::typeof(contraction4), ::Type{Active{Float64}}, ::Const{Vector{Int64}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:285
 [14] autodiff(::Enzyme.ReverseMode, ::typeof(contraction4), ::Const{Vector{Int64}}, ::Active{Vector{Matrix{Float64}}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:319
 [15] top-level scope
    @ c:\Users\domin\Dropbox (Personal)\side_projects\AdaptiveTrajectorySampling\src\approximations\tensor_approx.jl:82

Finally, a version which removes views and arrays of arrays, which seems to find a boxed matrix.

using Enzyme, LinearAlgebra

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = [rand(χ, χ) for i=1:d]
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

function contraction5(x, p1, p2, cache1, cache2)
    if x[1] == 0
        cache1 .= p1
    else
        cache1 .= p2
    end
    for i = 2:length(x)
        if x[i] == 0
            mul!(cache2, p1, cache1)
        else
            mul!(cache2, p2, cache1)
        end
        cache1 .= cache2
    end
    return tr(cache1)
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p1 = rand(χ, χ)
p2 = rand(χ, χ)
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction5, Const(x), Active(p1), Active(p2),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

Error

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
ERROR: Conversion of boxed type Matrix{Float64} is not allowed
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] convert(::Type{LLVM.LLVMType}, typ::Type; ctx::LLVM.Context, allow_boxed::Bool)
    @ LLVM.Interop C:\Users\domin\.julia\packages\LLVM\WjSQG\src\interop\base.jl:92
  [3] create_abi_wrapper(enzymefn::LLVM.Function, F::Type, argtypes::Vector{DataType}, rettype::Type, actualRetType::Type, Mode::Enzyme.API.CDerivativeMode, augmented::Nothing, dupClosure::Bool, width::Int64, returnPrimal::Bool)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3345
  [4] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction5), Tuple{Vector{Int64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}}}, mod::LLVM.Module, primalf::LLVM.Function, adjoint::GPUCompiler.FunctionSpec{typeof(contraction5), Tuple{Const{Vector{Int64}}, Active{Matrix{Float64}}, Active{Matrix{Float64}}, Duplicated{Matrix{Float64}}, Duplicated{Matrix{Float64}}}}, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, dupClosure::Bool, wrap::Bool, modifiedBetween::Bool, returnPrimal::Bool)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3278
  [5] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction5), Tuple{Vector{Int64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4158
  [6] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction5), Tuple{Vector{Int64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}}})
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4599
  [7] cached_compilation(job::GPUCompiler.CompilerJob, key::UInt64, specid::UInt64)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4637
  [8] #s565#115
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4697 [inlined]
  [9] var"#s565#115"(F::Any, Fn::Any, DF::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, specid::Any, ReturnPrimal::Any, ::Any, #unused#::Type, f::Any, df::Any, #unused#::Type, tt::Any, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Any)    
    @ Enzyme.Compiler .\none:0
 [10] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core .\boot.jl:580
 [11] thunk
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4725 [inlined]
 [12] thunk (repeats 2 times)
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4718 [inlined]
 [13] autodiff(::Enzyme.ReverseMode, ::typeof(contraction5), ::Type{Active{Float64}}, ::Const{Vector{Int64}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:285
 [14] autodiff(::Enzyme.ReverseMode, ::typeof(contraction5), ::Const{Vector{Int64}}, ::Active{Matrix{Float64}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:319
 [15] top-level scope
    @ c:\Users\domin\Dropbox (Personal)\side_projects\AdaptiveTrajectorySampling\src\approximations\tensor_approx.jl:109

Version info:

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

On Enzyme v0.10.4.

Apologies for the long post! Hope it helps.

DomCRose avatar Sep 02 '22 15:09 DomCRose

The problem here isn't the view of the array, but that since it's an array, it needs to be duplicated, not active. Try that and see what happens?

wsmoses avatar Sep 02 '22 17:09 wsmoses

I had an example failing with a similar error to the first MWE here, but that appears to be fixed on main with the latest JLL. Unfortunately, changing x and p to Duplicated in the MWE here does not appear to help. I also added a return type activity annotation for good measure, but no luck:

g = autodiff(
    Reverse, contraction1, Active, Duplicated(x, similar(x)), Duplicated(p, similar(p)),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

ToucheSir avatar Sep 08 '22 03:09 ToucheSir

What is the error you are seeing, if it is the following this is a distinct (type union) issue:

Illegal updateAnalysis prev:{[-1]:Integer} new: {[-1]:Float@double}
val:   %229 = bitcast i64 %.sroa.095.0267 to double, !dbg !403 origin=  %.pn = select i1 %value_phi26.i269, double %230, double %229, !dbg !403

Caused by:
Stacktrace:
 [1] contraction1
   @ ./REPL[4]:9
 [2] contraction1
   @ ./REPL[4]:0

Stacktrace:
  [1] julia_error(cstr::Cstring, val::Ptr{LLVM.API.LLVMOpaqueValue}, errtype::Enzyme.API.ErrorType, data::Ptr{Nothing})
    @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:3061
  [2] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{Enzyme.API.CDIFFE_TYPE}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{Nothing}, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{Bool}, augmented::Ptr{Nothing}, atomicAdd::Bool)
    @ Enzyme.API ~/git/Enzyme.jl/src/api.jl:118
  [3] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Vector{Int64}, Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}}}}, mod::LLVM.Module, primalf::LLVM.Function, adjoint::GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Duplicated{Vector{Int64}}, Duplicated{Array{Float64, 3}}, Duplicated{Matrix{Float64}}, Duplicated{Matrix{Float64}}}}, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, dupClosure::Bool, wrap::Bool, modifiedBetween::Bool, returnPrimal::Bool, jlrules::Vector{String})
    @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:3875
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Vector{Int64}, Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4850
  [5] _thunk
    @ ~/git/Enzyme.jl/src/compiler.jl:5278 [inlined]
  [6] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Vector{Int64}, Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}}}})
    @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5272
  [7] cached_compilation(job::GPUCompiler.CompilerJob, key::UInt64, specid::UInt64)
    @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5316
  [8] #s741#132
    @ ~/git/Enzyme.jl/src/compiler.jl:5376 [inlined]
  [9] var"#s741#132"(F::Any, Fn::Any, DF::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, specid::Any, ReturnPrimal::Any, ::Any, #unused#::Type, f::Any, df::Any, #unused#::Type, tt::Any, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Any)
    @ Enzyme.Compiler ./none:0
 [10] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core ./boot.jl:582
 [11] thunk
    @ ~/git/Enzyme.jl/src/compiler.jl:5404 [inlined]
 [12] thunk (repeats 2 times)
    @ ~/git/Enzyme.jl/src/compiler.jl:5397 [inlined]
 [13] autodiff(::Enzyme.ReverseMode, ::typeof(contraction1), ::Type{Active}, ::Duplicated{Vector{Int64}}, ::Vararg{Any})
    @ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:296
 [14] top-level scope
    @ REPL[12]:1

This error is caused by the fact that trace is a union of Float/Int. This should be fixable by changing trace = 0 to trace = 0.0

using Enzyme, LinearAlgebra

function contraction1(x, p, cache1, cache2)
    cache1 .= @view(p[x[1]+1, :, :])
    for i = 2:length(x)
        mul!(cache2, @view(p[x[i]+1, :, :]), cache1)
        cache1 .= cache2
    end
    trace = 0.0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

L = 4
χ = 3
d = 2
x = rand(0:(d-1), L)
p = rand(d, χ, χ)
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

g = autodiff(
    Reverse, contraction1, Active, Duplicated(x, similar(x)), Duplicated(p, similar(p)),
    Duplicated(cache1, similar(cache1)),
    Duplicated(cache2, similar(cache2))
)

On latest main this does succeed.

wsmoses avatar Sep 08 '22 20:09 wsmoses

Apologies for the slow response, and missing the need for duplicated! Good spot on the union also.

Bearing in mind I'm on julia 1.7 and the current release of Enzyme rather than main, I'll post what I'm currently seeing below anyway, and will have a go with more up to date versions later.

So, changing the autodiff calls to have duplicated p's and cache's, I'm seeing:

contraction1

Leaving trace as an Int to start with, the error I get is

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
ERROR: AssertionError: length(args) == length((collect(parameters(entry_f)))[1 + sret + returnRoots:end])
Stacktrace:
  [1] lower_convention(functy::Type, mod::LLVM.Module, entry_f::LLVM.Function, actualRetType::Type)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:3698
  [2] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Vector{Int64}, Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}}}}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, ctx::LLVM.Context, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4123
  [3] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams, GPUCompiler.FunctionSpec{typeof(contraction1), Tuple{Vector{Int64}, Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}}}})
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4599
  [4] cached_compilation(job::GPUCompiler.CompilerJob, key::UInt64, specid::UInt64)
    @ Enzyme.Compiler C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4637
  [5] #s565#115
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4697 [inlined]
  [6] var"#s565#115"(F::Any, Fn::Any, DF::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, specid::Any, ReturnPrimal::Any, ::Any, #unused#::Type, f::Any, df::Any, #unused#::Type, tt::Any, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Type, #unused#::Any)
    @ Enzyme.Compiler .\none:0
  [7] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core .\boot.jl:580
  [8] thunk
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4725 [inlined]
  [9] thunk (repeats 2 times)
    @ C:\Users\domin\.julia\packages\Enzyme\di3zM\src\compiler.jl:4718 [inlined]
 [10] autodiff(::Enzyme.ReverseMode, ::typeof(contraction1), ::Type{Active}, ::Const{Vector{Int64}}, ::Vararg{Any})
    @ Enzyme C:\Users\domin\.julia\packages\Enzyme\di3zM\src\Enzyme.jl:285
 [11] top-level scope
    @ c:\Users\domin\Dropbox (Personal)\side_projects\AdaptiveTrajectorySampling\src\approximations\tensor_approx.jl:61

If I remove the union by initializing trace as 0.0, it crashes the REPL and I'm not sure how to recover anything from that.

I see the same error in the union case if I also duplicate x, and the REPL also crashes in the no-union case if I duplicate x.

contraction2

This also crashes the REPL, I suspect for the same reasons.

contraction3

Works for small matrices with trace=0.0. With the union I see the same error as for the union case of contraction1.

contraction4

Works for small matrices.

contraction5

Works for small matrices.

Increasing / varying chi

However, I have noticed that if I increase chi (from e.g. 3 to 16), the results appear incorrect (or, at least, they disagree with zygote for a non-mutating function that I believe does the same thing). Its also quite slow compared to zygote for larger matrices, but I'm not sure if that is a primary concern right now, and guess that could be related to the BLAS fallback warnings.

For context, the function I'm using with Zygote is

function apply(x, p)
    return tr(mapreduce(xs -> p[xs+1], *, reverse(x)))
end

Finally, I've noticed that varying chi within a session seems to break the autodiff, with it sometimes returning NaNs or completely random and extremely large numbers.

I'll give this a go on 1.8 and main later.

DomCRose avatar Sep 09 '22 10:09 DomCRose

Can confirm on julia 1.8 and latest main with the new jll that all versions don't error when trace isn't a union. They also appear to be able to produce correct results (assuming Zygote is correct) at larger chi. However, I'm still seeing somewhat inconsistent behaviour, with the derivatives sometimes being full of NaNs or numbers on the order of 10^300. Every version of the function produces the warnings

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

warning: Linking two modules of different target triples: 'bcloader' is 'x86_64-w64-windows-gnu' whereas 'text' is 'x86_64-w64-mingw32'

┌ Warning: Using fallback BLAS replacements, performance may be degraded
└ @ Enzyme.Compiler C:\Users\domin\.julia\packages\GPUCompiler\jVY4I\src\utils.jl:35
warning: didn't implement memmove, using memcpy as fallback which can result in errors

This last one in particular, could this be causing the random incorrect results?

DomCRose avatar Sep 09 '22 11:09 DomCRose

Can you post the specific version of the code that creates incorrect result?

wsmoses avatar Sep 10 '22 19:09 wsmoses

The code I'm running is:

using Enzyme
using LinearAlgebra

function contraction1(x, p, cache1, cache2)
    cache1 .= @view(p[x[1]+1, :, :])
    for i = 2:length(x)
        mul!(cache2, @view(p[x[i]+1, :, :]), cache1)
        cache1 .= cache2
    end
    trace = 0.0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

function contraction2(x, p, cache1, cache2)
    cache1 .= @view(p[x[1]+1, :, :])
    for i = 2:length(x)
        mul!(cache2, @view(p[x[i]+1, :, :]), cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

function contraction3(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    trace = 0.0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

function contraction4(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

function contraction5(x, p1, p2, cache1, cache2)
    if x[1] == 0
        cache1 .= p1
    else
        cache1 .= p2
    end
    for i = 2:length(x)
        if x[i] == 0
            mul!(cache2, p1, cache1)
        else
            mul!(cache2, p2, cache1)
        end
        cache1 .= cache2
    end
    return tr(cache1)
end

L = 4
χ = 16
d = 2
x = rand(0:(d-1), L)
p = randn(d, χ, χ) .* 0.25
p2 = [p[i, :, :] for i = 1:d]
p31 = p2[1]
p32 = p2[2]
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

begin
    dfdp1 = zero(p)
    g = autodiff(
        Reverse, contraction1, Const(x), 
        Duplicated(p, dfdp1),
        Duplicated(cache1, similar(cache1)),
        Duplicated(cache2, similar(cache2))
    )
    display(dfdp1)
end
begin
    dfdp1 = zero(p)
    g = autodiff(
        Reverse, contraction2, Const(x), 
        Duplicated(p, dfdp1),
        Duplicated(cache1, similar(cache1)),
        Duplicated(cache2, similar(cache2))
    )
    display(dfdp1)
end
begin
    dfdp2 = zero.(p2)
    g = autodiff(
        Reverse, contraction3, Const(x), 
        Duplicated(p2, dfdp2),
        Duplicated(cache1, similar(cache1)),
        Duplicated(cache2, similar(cache2))
    )
    display(dfdp2)
end
begin
    dfdp2 = zero.(p2)
    g = autodiff(
        Reverse, contraction4, Const(x), 
        Duplicated(p2, dfdp2),
        Duplicated(cache1, similar(cache1)),
        Duplicated(cache2, similar(cache2))
    )
    display(dfdp2)
end
begin
    dfdp31 = zero(p31)
    dfdp32 = zero(p32)
    g = autodiff(
        Reverse, contraction5, Const(x),
        Duplicated(p31, dfdp31),
        Duplicated(p32, dfdp32),
        Duplicated(cache1, similar(cache1)),
        Duplicated(cache2, similar(cache2))
    )
    display(dfdp31)
    display(dfdp32)
end

As I rerun these begin-end blocks in the REPL repeatedly (in the vscode integrated terminal), while I begin by seeing correct results, I regularly see arrays full of NaNs, or numbers up at the floating point limit. After running the blocks enough times the results even seemed to become random but finite. Adding Active for the return or duplicating x doesn't seem to help.

Version info for the PC I just ran this on:

Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, rocketlake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  JULIA_PKG_SERVER = .

DomCRose avatar Sep 12 '22 13:09 DomCRose

So it just occured to me that this could originate from using similar instead of zero for the cache duplication, since that is somewhat random and can produce NaNs which could presumably propagate. Indeed, replacing all the similars with zero seems to make it deterministic and correct!

DomCRose avatar Sep 12 '22 15:09 DomCRose

Yes for reverse mode you, the derivative is +='d into, so you need to zero initialize.

Closing for now since it seems like this is resolved, reopen if not.

wsmoses avatar Sep 12 '22 17:09 wsmoses

Since this is still open, I would comment that somewhere around 0.10.6 the ones of these with views broke again, however they all appear to be working on 0.10.11.

The results are quite slow with larger matrices compared with e.g. Zygote (about a factor of 8 for 16x16 matrices, factor of 100 for 64x64 matrices), but I'm guessing that will be fixed when the BLAS support is improved, so I think its safe to close this.

DomCRose avatar Oct 11 '22 12:10 DomCRose

Can you post the benchmark code?

vchuravy avatar Oct 11 '22 14:10 vchuravy

Sure, for Enzyme I'm doing:

using BenchmarkTools
using Enzyme
using LinearAlgebra

function contraction1(x, p, cache1, cache2)
    cache1 .= @view(p[:, :, x[1]+1])
    for i = 2:length(x)
        mul!(cache2, @view(p[:, :, x[i]+1]), cache1)
        cache1 .= cache2
    end
    trace = 0.0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

function contraction2(x, p, cache1, cache2)
    cache1 .= @view(p[:, :, x[1]+1])
    for i = 2:length(x)
        mul!(cache2, @view(p[:, :, x[i]+1]), cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

function contraction3(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    trace = 0.0
    for i in 1:size(cache1, 1)
        trace += cache1[i, i]
    end
    return trace
end

function contraction4(x, p, cache1, cache2)
    cache1 .= p[x[1]+1]
    for i = 2:length(x)
        mul!(cache2, p[x[i]+1], cache1)
        cache1 .= cache2
    end
    return tr(cache1)
end

function contraction5(x, p1, p2, cache1, cache2)
    if x[1] == 0
        cache1 .= p1
    else
        cache1 .= p2
    end
    for i = 2:length(x)
        if x[i] == 0
            mul!(cache2, p1, cache1)
        else
            mul!(cache2, p2, cache1)
        end
        cache1 .= cache2
    end
    return tr(cache1)
end

L = 4
χ = 16
d = 2
x = rand(0:(d-1), L)
p = randn(χ, χ, d) .* 0.25
p2 = [p[:, :, i] for i = 1:d]
p31 = p2[1]
p32 = p2[2]
cache1 = zeros(χ, χ)
cache2 = zeros(χ, χ)

begin
    dfdp1 = zero(p)
    @btime autodiff(
        Reverse, contraction1, Const(x),
        Duplicated(p, dfdp1),
        Duplicated(cache1, zero(cache1)),
        Duplicated(cache2, zero(cache2))
    )
    # χ = 16: 128.400 μs (74 allocations: 13.72 KiB)
    # χ = 64: 8.471 ms (76 allocations: 73.56 KiB)
end
begin
    dfdp1 = zero(p)
    @btime autodiff(
        Reverse, contraction2, Const(x),
        Duplicated(p, dfdp1),
        Duplicated(cache1, zero(cache1)),
        Duplicated(cache2, zero(cache2))
    )
    # χ = 16: 128.000 μs (74 allocations: 13.72 KiB)
    # χ = 64: 8.478 ms (76 allocations: 73.56 KiB)
end
begin
    dfdp2 = zero.(p2)
    @btime autodiff(
        Reverse, contraction3, Const(x),
        Duplicated(p2, dfdp2),
        Duplicated(cache1, zero(cache1)),
        Duplicated(cache2, zero(cache2))
    )
    # χ = 16: 123.100 μs (75 allocations: 8.09 KiB)
    # χ = 64: 8.295 ms (77 allocations: 67.94 KiB)
end
begin
    dfdp2 = zero.(p2)
    @btime autodiff(
        Reverse, contraction4, Const(x),
        Duplicated(p2, dfdp2),
        Duplicated(cache1, zero(cache1)),
        Duplicated(cache2, zero(cache2))
    )
    # χ = 16: 123.000 μs (75 allocations: 8.09 KiB)
    # χ = 64: 8.288 ms (77 allocations: 67.94 KiB)
end
begin
    dfdp31 = zero(p31)
    dfdp32 = zero(p32)
    @btime autodiff(
        Reverse, contraction5, Const(x),
        Duplicated(p31, dfdp31),
        Duplicated(p32, dfdp32),
        Duplicated(cache1, zero(cache1)),
        Duplicated(cache2, zero(cache2))
    )
    # χ = 16: 123.400 μs (85 allocations: 8.52 KiB)
    # χ = 64: 8.289 ms (87 allocations: 68.36 KiB)
end

wrapping the autodiff calls / cache duplication in a function definition reduces allocations but doesn't seem to change the timings.

While for Zygote I'm doing

function apply(x, p)
    return tr(mapreduce(xs -> p[xs+1], *, reverse(x)))
end
function apply2(x, p)
    tmp = p[x[1]+1]
    for i = 2:length(x)
        tmp = p[x[i]+1] * tmp
    end
    return tr(tmp)
end

using Zygote
@btime Zygote.gradient(apply, $x, $p2)
# χ = 16: 27.800 μs (393 allocations: 38.53 KiB)
# χ = 64: 96.500 μs (383 allocations: 365.78 KiB)
@btime Zygote.gradient(apply2, $x, $p2)
# χ = 16: 15.900 μs (245 allocations: 33.75 KiB)
# χ = 64: 82.100 μs (234 allocations: 360.16 KiB)

with the same p2.

My target use case is actually something more complex that ends up with a lot of overhead in Zygote (and seems quite difficult to write without mutation, or the mutation-like syntax in JAX), and therefore ends up being quite slow compared to an equivalent JAX implementation.

(Specifically, I'm aiming for a Julia implementation of the policy function here https://github.com/RL-with-TNs/acten_code/blob/aa00f626fe37397b724108caaa95c08e2f13ce5b/ACTeN/src/approximations/policy_mps_east.py#L7, which then needs to be differentiated with respect to the third argument.)

DomCRose avatar Oct 11 '22 15:10 DomCRose

As an aside @vchuravy which is why I left this open. I don't think zero init'ing the temporary should be required.

wsmoses avatar Oct 11 '22 17:10 wsmoses