Extremely long compilation time on recursive, branching functions (DynamicExpressions.jl)
The old issue #1018 from August was getting a bit lengthy so I'm moving to a new issue – feel free to close that one.
This relates to my one-year effort to try to get Enzyme.jl working as one of the AD backends for DynamicExpressions.jl, SymbolicRegression.jl, and PySR. The current status is:
- Working first-order gradients, if I disable some of the optimizations
- Hanging first-order gradients (extremely long compilation time), if all the optimizations are left on
- Hanging second-order gradients (extremely long compilation time), regardless of optimization settings
(expand) I've boiled down the MWE to the following code which replicates the issues I am seeing:
using Enzyme
################################################################################
### OperatorEnum.jl
################################################################################
struct OperatorEnum{B,U}
binops::B
unaops::U
end
################################################################################
################################################################################
### Equation.jl
################################################################################
mutable struct Node{T}
degree::UInt8 # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc.
constant::Bool # false if variable
val::Union{T,Nothing} # If is a constant, this stores the actual value
# ------------------- (possibly undefined below)
feature::UInt16 # If is a variable (e.g., x in cos(x)), this stores the feature index.
op::UInt8 # If operator, this is the index of the operator in operators.binops, or operators.unaops
l::Node{T} # Left child node. Only defined for degree=1 or degree=2.
r::Node{T} # Right child node. Only defined for degree=2.
Node(d::Integer, c::Bool, v::_T) where {_T} = new{_T}(UInt8(d), c, v)
Node(::Type{_T}, d::Integer, c::Bool, v::_T) where {_T} = new{_T}(UInt8(d), c, v)
Node(::Type{_T}, d::Integer, c::Bool, v::Nothing, f::Integer) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f))
Node(d::Integer, c::Bool, v::Nothing, f::Integer, o::Integer, l::Node{_T}) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f), UInt8(o), l)
Node(d::Integer, c::Bool, v::Nothing, f::Integer, o::Integer, l::Node{_T}, r::Node{_T}) where {_T} = new{_T}(UInt8(d), c, v, UInt16(f), UInt8(o), l, r)
end
function Node(::Type{T}; val::T1=nothing, feature::T2=nothing)::Node{T} where {T,T1,T2}
if T2 <: Nothing
!(T1 <: T) && (val = convert(T, val))
return Node(T, 0, true, val)
else
return Node(T, 0, false, nothing, feature)
end
end
Node(op::Integer, l::Node{T}) where {T} = Node(1, false, nothing, 0, op, l)
Node(op::Integer, l::Node{T}, r::Node{T}) where {T} = Node(2, false, nothing, 0, op, l, r)
################################################################################
################################################################################
### Utils.jl
################################################################################
@inline function fill_similar(value, array, args...)
out_array = similar(array, args...)
out_array .= value
return out_array
end
is_bad_array(array) = !(isempty(array) || isfinite(sum(array)))
function is_constant(tree::Node)
if tree.degree == 0
return tree.constant
elseif tree.degree == 1
return is_constant(tree.l)
else
return is_constant(tree.l) && is_constant(tree.r)
end
end
################################################################################
################################################################################
### EvaluateEquation.jl
################################################################################
struct ResultOk{A<:AbstractArray}
x::A
ok::Bool
end
function eval_tree_array(tree::Node{T}, cX::AbstractMatrix{T}, operators::OperatorEnum, fuse_level=Val(2)) where {T<:Number}
result = _eval_tree_array(tree, cX, operators, fuse_level)
return (result.x, result.ok && !is_bad_array(result.x))
end
counttuple(::Type{<:NTuple{N,Any}}) where {N} = N
get_nuna(::Type{<:OperatorEnum{B,U}}) where {B,U} = counttuple(U)
get_nbin(::Type{<:OperatorEnum{B}}) where {B} = counttuple(B)
@generated function _eval_tree_array(tree::Node{T}, cX::AbstractMatrix{T}, operators::OperatorEnum, ::Val{fuse_level})::ResultOk where {T<:Number,fuse_level}
nuna = get_nuna(operators)
nbin = get_nbin(operators)
quote
# First, we see if there are only constants in the tree - meaning
# we can just return the constant result.
if tree.degree == 0
return deg0_eval(tree, cX)
elseif is_constant(tree)
# Speed hack for constant trees.
const_result = _eval_constant_tree(tree, operators)::ResultOk{Vector{T}}
!const_result.ok && return ResultOk(similar(cX, axes(cX, 2)), false)
return ResultOk(fill_similar(const_result.x[], cX, axes(cX, 2)), true)
elseif tree.degree == 1
op_idx = tree.op
# This @nif lets us generate an if statement over choice of operator,
# which means the compiler will be able to completely avoid type inference on operators.
return Base.Cartesian.@nif(
$nuna,
i -> i == op_idx,
i -> let op = operators.unaops[i]
if fuse_level > 1 && tree.l.degree == 2 && tree.l.l.degree == 0 && tree.l.r.degree == 0
# op(op2(x, y)), where x, y, z are constants or variables.
l_op_idx = tree.l.op
Base.Cartesian.@nif(
$nbin,
j -> j == l_op_idx,
j -> let op_l = operators.binops[j]
deg1_l2_ll0_lr0_eval(tree, cX, op, op_l)
end,
)
elseif fuse_level > 1 && tree.l.degree == 1 && tree.l.l.degree == 0
# op(op2(x)), where x is a constant or variable.
l_op_idx = tree.l.op
Base.Cartesian.@nif(
$nuna,
j -> j == l_op_idx,
j -> let op_l = operators.unaops[j]
deg1_l1_ll0_eval(tree, cX, op, op_l)
end,
)
else
# op(x), for any x.
result = _eval_tree_array(tree.l, cX, operators, Val(fuse_level))
!result.ok && return result
deg1_eval(result.x, op)
end
end
)
else
op_idx = tree.op
return Base.Cartesian.@nif(
$nbin,
i -> i == op_idx,
i -> let op = operators.binops[i]
if fuse_level > 1 && tree.l.degree == 0 && tree.r.degree == 0
deg2_l0_r0_eval(tree, cX, op)
elseif tree.r.degree == 0
result_l = _eval_tree_array(tree.l, cX, operators, Val(fuse_level))
!result_l.ok && return result_l
# op(x, y), where y is a constant or variable but x is not.
deg2_r0_eval(tree, result_l.x, cX, op)
elseif tree.l.degree == 0
result_r = _eval_tree_array(tree.r, cX, operators, Val(fuse_level))
!result_r.ok && return result_r
# op(x, y), where x is a constant or variable but y is not.
deg2_l0_eval(tree, result_r.x, cX, op)
else
result_l = _eval_tree_array(tree.l, cX, operators, Val(fuse_level))
!result_l.ok && return result_l
result_r = _eval_tree_array(tree.r, cX, operators, Val(fuse_level))
!result_r.ok && return result_r
# op(x, y), for any x or y
deg2_eval(result_l.x, result_r.x, op)
end
end
)
end
end
end
function deg2_eval(
cumulator_l::AbstractVector{T}, cumulator_r::AbstractVector{T}, op::F
)::ResultOk where {T<:Number,F}
@inbounds @simd for j in eachindex(cumulator_l)
x = op(cumulator_l[j], cumulator_r[j])::T
cumulator_l[j] = x
end
return ResultOk(cumulator_l, true)
end
function deg1_eval(
cumulator::AbstractVector{T}, op::F
)::ResultOk where {T<:Number,F}
@inbounds @simd for j in eachindex(cumulator)
x = op(cumulator[j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
function deg0_eval(tree::Node{T}, cX::AbstractMatrix{T})::ResultOk where {T<:Number}
if tree.constant
return ResultOk(fill_similar(tree.val::T, cX, axes(cX, 2)), true)
else
return ResultOk(cX[tree.feature, :], true)
end
end
function deg1_l2_ll0_lr0_eval(
tree::Node{T}, cX::AbstractMatrix{T}, op::F, op_l::F2
) where {T<:Number,F,F2}
if tree.l.l.constant && tree.l.r.constant
val_ll = tree.l.l.val::T
val_lr = tree.l.r.val::T
x_l = op_l(val_ll, val_lr)::T
x = op(x_l)::T
return ResultOk(fill_similar(x, cX, axes(cX, 2)), true)
elseif tree.l.l.constant
val_ll = tree.l.l.val::T
feature_lr = tree.l.r.feature
cumulator = similar(cX, axes(cX, 2))
@inbounds @simd for j in axes(cX, 2)
x_l = op_l(val_ll, cX[feature_lr, j])::T
x = isfinite(x_l) ? op(x_l)::T : T(Inf)
cumulator[j] = x
end
return ResultOk(cumulator, true)
elseif tree.l.r.constant
feature_ll = tree.l.l.feature
val_lr = tree.l.r.val::T
cumulator = similar(cX, axes(cX, 2))
@inbounds @simd for j in axes(cX, 2)
x_l = op_l(cX[feature_ll, j], val_lr)::T
x = isfinite(x_l) ? op(x_l)::T : T(Inf)
cumulator[j] = x
end
return ResultOk(cumulator, true)
else
feature_ll = tree.l.l.feature
feature_lr = tree.l.r.feature
cumulator = similar(cX, axes(cX, 2))
@inbounds @simd for j in axes(cX, 2)
x_l = op_l(cX[feature_ll, j], cX[feature_lr, j])::T
x = isfinite(x_l) ? op(x_l)::T : T(Inf)
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
end
# op(op2(x)) for x variable or constant
function deg1_l1_ll0_eval(
tree::Node{T}, cX::AbstractMatrix{T}, op::F, op_l::F2
) where {T<:Number,F,F2}
if tree.l.l.constant
val_ll = tree.l.l.val::T
x_l = op_l(val_ll)::T
x = op(x_l)::T
return ResultOk(fill_similar(x, cX, axes(cX, 2)), true)
else
feature_ll = tree.l.l.feature
cumulator = similar(cX, axes(cX, 2))
@inbounds @simd for j in axes(cX, 2)
x_l = op_l(cX[feature_ll, j])::T
x = isfinite(x_l) ? op(x_l)::T : T(Inf)
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
end
# op(x, y) for x and y variable/constant
function deg2_l0_r0_eval(
tree::Node{T}, cX::AbstractMatrix{T}, op::F
) where {T<:Number,F}
if tree.l.constant && tree.r.constant
val_l = tree.l.val::T
val_r = tree.r.val::T
x = op(val_l, val_r)::T
return ResultOk(fill_similar(x, cX, axes(cX, 2)), true)
elseif tree.l.constant
cumulator = similar(cX, axes(cX, 2))
val_l = tree.l.val::T
feature_r = tree.r.feature
@inbounds @simd for j in axes(cX, 2)
x = op(val_l, cX[feature_r, j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
elseif tree.r.constant
cumulator = similar(cX, axes(cX, 2))
feature_l = tree.l.feature
val_r = tree.r.val::T
@inbounds @simd for j in axes(cX, 2)
x = op(cX[feature_l, j], val_r)::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
else
cumulator = similar(cX, axes(cX, 2))
feature_l = tree.l.feature
feature_r = tree.r.feature
@inbounds @simd for j in axes(cX, 2)
x = op(cX[feature_l, j], cX[feature_r, j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
end
# op(x, y) for x variable/constant, y arbitrary
function deg2_l0_eval(
tree::Node{T}, cumulator::AbstractVector{T}, cX::AbstractArray{T}, op::F
) where {T<:Number,F}
if tree.l.constant
val = tree.l.val::T
@inbounds @simd for j in eachindex(cumulator)
x = op(val, cumulator[j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
else
feature = tree.l.feature
@inbounds @simd for j in eachindex(cumulator)
x = op(cX[feature, j], cumulator[j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
end
# op(x, y) for x arbitrary, y variable/constant
function deg2_r0_eval(
tree::Node{T}, cumulator::AbstractVector{T}, cX::AbstractArray{T}, op::F
) where {T<:Number,F}
if tree.r.constant
val = tree.r.val::T
@inbounds @simd for j in eachindex(cumulator)
x = op(cumulator[j], val)::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
else
feature = tree.r.feature
@inbounds @simd for j in eachindex(cumulator)
x = op(cumulator[j], cX[feature, j])::T
cumulator[j] = x
end
return ResultOk(cumulator, true)
end
end
@generated function _eval_constant_tree(tree::Node{T}, operators::OperatorEnum) where {T<:Number}
nuna = get_nuna(operators)
nbin = get_nbin(operators)
quote
if tree.degree == 0
return deg0_eval_constant(tree)::ResultOk{Vector{T}}
elseif tree.degree == 1
op_idx = tree.op
return Base.Cartesian.@nif(
$nuna,
i -> i == op_idx,
i -> deg1_eval_constant(
tree, operators.unaops[i], operators
)::ResultOk{Vector{T}}
)
else
op_idx = tree.op
return Base.Cartesian.@nif(
$nbin,
i -> i == op_idx,
i -> deg2_eval_constant(
tree, operators.binops[i], operators
)::ResultOk{Vector{T}}
)
end
end
end
@inline function deg0_eval_constant(tree::Node{T}) where {T<:Number}
output = tree.val::T
return ResultOk([output], true)::ResultOk{Vector{T}}
end
function deg1_eval_constant(tree::Node{T}, op::F, operators::OperatorEnum) where {T<:Number,F}
result = _eval_constant_tree(tree.l, operators)
!result.ok && return result
output = op(result.x[])::T
return ResultOk([output], isfinite(output))::ResultOk{Vector{T}}
end
function deg2_eval_constant(tree::Node{T}, op::F, operators::OperatorEnum) where {T<:Number,F}
cumulator = _eval_constant_tree(tree.l, operators)
!cumulator.ok && return cumulator
result_r = _eval_constant_tree(tree.r, operators)
!result_r.ok && return result_r
output = op(cumulator.x[], result_r.x[])::T
return ResultOk([output], isfinite(output))::ResultOk{Vector{T}}
end
################################################################################
Now, we can see that the forward pass works okay:
# Operators to use:
operators = OperatorEnum((+, -, *, /), (cos, sin, exp, tanh))
# Variables:
x1, x2, x3 = (i -> Node(Float64; feature=i)).(1:3)
# Expression:
tree = Node(1, x1, Node(1, x2)) # == x1 + cos(x2)
# Input data
X = randn(3, 100);
# Output:
eval_tree_array(tree, X, operators, Val(2))
This evaluates x1 + cos(x2) over 100 random rows. Both Val(1) and Val(2) (fuse_level=1 and =2, respectively) here will work and produce the same output. Please see with ctrl-F which parts it activates in the code – it basically just turns on a couple branches related to "fused" operators (e.g., sin(exp(x)) evaluated over the data inside a single loop).
Now, if I try compiling the reverse-mode gradient with respect to the input data for fuse-level 1:
f(tree, X, operators, output) = (output[] = sum(eval_tree_array(tree, X, operators, Val(1))[1]); nothing)
dX = Enzyme.make_zero(X)
output = [0.0]
doutput = [1.0]
autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
This takes about 1 minute to compile. But, once it's compiled, it's pretty fast.
However, if I switch on some of the optimizations (fuse_level=2):
f(tree, X, operators, output) = (output[] = sum(eval_tree_array(tree, X, operators, Val(2))[1]); nothing)
output = [0.0]
doutput = [1.0]
autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
This seems to hang forever. I left it going for about a day and came back and it was still running. I'm assuming it will finish eventually, but it's obviously not a good solution as the existing AD backends with forward-mode auto-diff compile in under a second. And if the user changes data types or operators, it will need to recompile again.
If I force it to quit with ctl-\, I see various LLVM calls:
[68568] signal (3): Quit: 3
in expression starting at REPL[44]:1
__psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
__psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
__psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
__psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
__psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line)
unknown function (ip: 0x0)
_ZN4llvm22MustBeExecutedIterator7advanceEv at /Users/mcranmer/.julia/juliaup/julia-1.10.0-rc1+0.aarch64.apple.darwin14/lib/julia/libLLVM.dylib (unknown line)
unknown function (ip: 0x0)
Allocations: 37668296 (Pool: 37621782; Big: 46514); GC: 45
Any idea how to get this scaling better? It seems like some step of the compilation is hanging here and it is scaling exponentially with the number of branches.
Edit: Updated code MWE to further reduce it.
Okay I have verified the gradients are not broken. It's just extremely long. For example, if I reduce the number of functions to just 4 in total:
operators = OperatorEnum((+, -), (cos, sin))
then the compilation with Val(2) takes about 30 seconds to compile. But every function I add here, it seems to exponentially scale the compilation time.
@wsmoses I remember you mentioned some cache lock may be hanging here: https://github.com/EnzymeAD/Enzyme.jl/issues/1018#issuecomment-1689434106. Could that be the cause of this?
cc @vchuravy
Yeah that last issue seemed to have some sort of deadlock (aka a lock state that would never release)
@MilesCranmer would you be able to simplify your hanging MWE.
It doesn't need to compute the same thing as long as it still hangs (e.g. sum -> first, etc). The plentiful macros, generated functions, etc is making it difficult to diagnose.
Unfortunately all of this stuff is required to reproduce it. The generated functions are necessary for Base.Cartesian.@nif to work, and that itself is necessary for Enzyme to work, otherwise there’s a type instability in the operators used. You can see the statements that get turned on with fuse_level > 1. Those are what gives the very large compilation time.
Can you look at https://github.com/JuliaGPU/GPUCompiler.jl/blob/21ca075c1e91fe0c15f1330ab487b4831013ec1f/src/jlgen.jl#L175 after the compilation?
@MilesCranmer in particular I am interesting in how much it grows.
See also https://github.com/EnzymeAD/Enzyme.jl/issues/1182
Oh I see what you mean. Yeah let me check.
Okay for both fuse levels it is the same length of dictionary – just 3.
Here's the code I'm using to test (referencing the code in the original snippet above)
function gen_f(::Val{fuse_level}) where {fuse_level}
function (tree, X, operators, output)
output[] = sum(eval_tree_array(tree, X, operators, Val(fuse_level))[1])
return nothing
end
end
function run_test(fuse_level::Integer)
# Operators to use:
operators = OperatorEnum((+, -), (cos, sin))
# Variables:
x1, x2, x3 = (i -> Node(Float64; feature=i)).(1:3)
# Expression:
tree = Node(1, x1, Node(1, x2)) # == x1 + cos(x2)
# Input data
X = randn(3, 100)
# Output:
# eval_tree_array(tree, X, operators, Val(2))
output = [0.0]
doutput = [1.0]
dX = zeros(size(X))
autodiff(
Reverse,
gen_f(Val(fuse_level)),
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
dX
end
This number of operators is still pretty light. If I go to OperatorEnum((+, -, *, /), (cos, sin)) and fuse_level=2, it will hit a stack overflow with no explanation for where it occurred. So it seems like there is some recursive function that is working really really hard for this type of branching?
If I first do run_test(1), and then in the same session, run_test(2), it says that the cache is 5 in length.
If maybe you are asking about the contents of the entries of GLOBAL_CI_CACHES (?), then this is the result:
I first run run_test(1).
Then, GLOBAL_CI_CACHES is length 3.
Within that, the contents are:
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1006
k = CompilerConfig for GPUCompiler.NativeCompilerTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1011
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
I then run run_test(2). The cache has 5 elements in total. The contents are:
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1006
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1017
k = CompilerConfig for GPUCompiler.NativeCompilerTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 1025
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
k = CompilerConfig for Enzyme.Compiler.EnzymeTarget
length((Enzyme.GPUCompiler.GLOBAL_CI_CACHES[k]).dict) = 0
If I then change the definition of run_test so that the OperatorEnum has an additional operator, and run run_test(1), I see the NativeCompilerTarget have 1039 entries. The EnzymeTarget for that run seems to be 1010 entries.
With that updated run_test, if I then do run_test(2), I see NativeCompilerTarget grow to 1045 entries, and the latest EnzymeTarget have 1024 entries.
@wsmoses I just managed to find a way to further reduce the code while still getting the long compilation behavior (updated first comment). It's a bit faster overall so I'm not sure if all the same behavior is there, but the exponential scaling from fuse_level=1 to fuse_level=2, or for more operators, seems to be the same.
(Note that the generated functions cannot be removed as they are required for the Base.Cartesian.@nif and that is required for Enzyme to know the operators at compile time.)
Hey both, Just wanted to check-in on this and see how things are going and whether you've had a chance to check it out? Thanks for all your work on this great package. Best, Miles
To temper expectations, I probably won't have time to look at things until end of January.
@MilesCranmer the latest release will have just added substantial improvements to compile time. However I don't think they address the root cause (presumably nested gpucompiler compilation?), so it'll be a constant overhead speedup -- but still not nothing.
@vchuravy do you have time to diagnose this together next week?
sorry finally starting to investigate this more closely.
for the first function which you said was a minute compile.
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
45.757168 seconds (27.99 M allocations: 1.761 GiB, 1.81% gc time, 100.00% compilation time)
((nothing, nothing, nothing, nothing),)
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
0.000065 seconds (161 allocations: 10.609 KiB)
((nothing, nothing, nothing, nothing),)
Time seems to persist on current main.
julia> @time autodiff(
Reverse,
f,
Const(tree),
Duplicated(X, dX),
Const(operators),
Duplicated(output, doutput)
)
^CERROR: InterruptException:
Stacktrace:
[1] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{…}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{…}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{…}, augmented::Ptr{…}, atomicAdd::Bool)
@ Enzyme.API ~/git/Enzyme.jl/src/api.jl:154
[2] enzyme!(job::GPUCompiler.CompilerJob{…}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::NTuple{…}, returnPrimal::Bool, expectedTapeType::Type, loweredArgs::Set{…}, boxedArgs::Set{…})
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:3177
[3] codegen(output::Symbol, job::GPUCompiler.CompilerJob{…}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5070
[4] codegen
@ ~/git/Enzyme.jl/src/compiler.jl:4477 [inlined]
[5] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5755
[6] _thunk
@ ~/git/Enzyme.jl/src/compiler.jl:5755 [inlined]
[7] cached_compilation
@ ~/git/Enzyme.jl/src/compiler.jl:5793 [inlined]
[8] (::Enzyme.Compiler.var"#554#555"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, NTuple{5, Bool}, Int64, Bool, Bool, UInt64, DataType})(ctx::LLVM.Context)
@ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5859
[9] JuliaContext(f::Enzyme.Compiler.var"#554#555"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, NTuple{5, Bool}, Int64, Bool, Bool, UInt64, DataType}; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:52
[10] JuliaContext(f::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:42
[11] #s2027#553
@ ~/git/Enzyme.jl/src/compiler.jl:5811 [inlined]
[12] var"#s2027#553"(FA::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, ReturnPrimal::Any, ShadowInit::Any, World::Any, ABI::Any, ::Any, ::Type, ::Type, ::Type, tt::Any, ::Type, ::Type, ::Type, ::Type, ::Type, ::Any)
@ Enzyme.Compiler ./none:0
[13] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
@ Core ./boot.jl:602
[14] autodiff
@ ~/git/Enzyme.jl/src/Enzyme.jl:286 [inlined]
[15] autodiff
@ ~/git/Enzyme.jl/src/Enzyme.jl:315 [inlined]
[16] autodiff(::ReverseMode{false, FFIABI, false}, ::typeof(f), ::Const{Node{Float64}}, ::Duplicated{Matrix{Float64}}, ::Const{OperatorEnum{Tuple{…}, Tuple{…}}}, ::Duplicated{Vector{Float64}})
@ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:300
[17] macro expansion
@ ./timing.jl:279 [inlined]
[18] top-level scope
@ ./REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
Looks like its in the autodiff function itself rather than a deadlock (hopefully).
@vchuravy do you mind hitting it with a profile like when we were perf debugging sarah's code?
I have an attempt v2 of using Enzyme in SymbolicRegression.jl here: https://github.com/MilesCranmer/SymbolicRegression.jl/pull/326. This is the extension file itself: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/98d6329355eae8567304eed3e163eba300d1a7d8/ext/SymbolicRegressionEnzymeExt.jl.
I got some small-scale tests working (takes a while to compile), but when I do a full test and use it as the AD backend for symbolic regression, it just hangs.
Looking at btop:
I can see that Julia isn't even spinning the CPUs.
It's weird because it seems to be worse than before even. Before, even though it took a very long time, it eventually compiled. But now it seems to just freeze. Any guesses what it's from? This is on Julia 1.10.4 on macOS. Let me know if there's any other info I can try to provide.
It's weird because it seems to be worse than before even. Before, even though it took a very long time, it eventually compiled. But now it seems to just freeze. Any guesses what it's from? This is on Julia 1.10.4 on macOS. Let me know if there's any other info I can try to provide.
Attach lldb and get a stacktrace on all threads. Are you using multiple threads and autodiff_deferred?
Attach lldb
Will do
Are you using multiple threads
Yes (auto) I turned off multiple threads within the search – it should run serially (though Julia itself has access to multiple)
and
autodiff_deferred?
No
Ran in lldb, maybe this is it @vchuravy?
(lldb) frame info
frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
And the backtrace:
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
frame #1: 0x0000000105354638 libjulia-internal.1.10.4.dylib`_jl_mutex_lock(self=0x000000010bf65c30, lock=0x00000001056b1fc8) at threading.c:875:5 [opt]
frame #2: 0x0000000105af0f28 libjulia-codegen.1.10.4.dylib`::jl_generate_fptr_impl(jl_method_instance_t *, size_t, int *) [inlined] jl_mutex_lock(lock=<unavailable>) at julia_locks.h:65:5 [opt]
frame #3: 0x0000000105af0f18 libjulia-codegen.1.10.4.dylib`jl_generate_fptr_impl(mi=0x00000001777dfc10, world=31546, did_compile=0x000000015768e484) at jitlayers.cpp:483:5 [opt]
frame #4: 0x0000000105309e3c libjulia-internal.1.10.4.dylib`jl_compile_method_internal(mi=0x00000001777dfc10, world=31546) at gf.c:2481:16 [opt]
frame #5: 0x000000010530d104 libjulia-internal.1.10.4.dylib`ijl_apply_generic [inlined] _jl_invoke(F=0x0000000174ff0670, args=0x000000015768e5a8, nargs=2, mfunc=0x00000001777dfc10, world=31546) at gf.c:2887:16 [opt]
frame #6: 0x000000010530d0d0 libjulia-internal.1.10.4.dylib`ijl_apply_generic(F=0x0000000174ff0670, args=0x000000015768e5a8, nargs=<unavailable>) at gf.c:3077:12 [opt]
frame #7: 0x0000000171a174d0 X2eIS_auMC0.dylib`julia_value_gradientNOT.NOT._25951 at interface.jl:82
frame #8: 0x0000000171a68c7c X2eIS_auMC0.dylib`julia_initial_state_25915 at bfgs.jl:94
frame #9: 0x0000000171300de4
frame #10: 0x00000001713089cc
frame #11: 0x000000016751c208
frame #12: 0x00000001712c841c
frame #13: 0x000000010532c8d8 libjulia-internal.1.10.4.dylib`start_task [inlined] jl_apply(args=<unavailable>, nargs=1) at julia.h:1982:12 [opt]
frame #14: 0x000000010532c8cc libjulia-internal.1.10.4.dylib`start_task at task.c:1238:19 [opt]
When I continue and then ctrl-c:
(lldb) c
Process 27992 resuming
Process 27992 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000010535467c libjulia-internal.1.10.4.dylib`_jl_mutex_lock [inlined] _jl_mutex_wait(self=0x000000010bf65c30, lock=0x00000001056b1fc8, safepoint=1) at threading.c:847:17 [opt]
844 uv_mutex_unlock(&tls_lock);
845 }
846 jl_cpu_suspend();
-> 847 owner = jl_atomic_load_relaxed(&lock->owner);
848 }
849 }
850
When I run it with --threads=1:
julia --startup-file=no --project=. --threads=1 example.jl
ERROR: LoadError: TaskFailedException
nested task error: StackOverflowError:
Stacktrace:
[1] threading_run(fun::SymbolicRegression.SingleIterationModule.var"#296#threadsfor_fun#5"{SymbolicRegression.SingleIterationModule.var"#296#threadsfor_fun#4#6"{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, BitVector, Vector{Float64}, UnitRange{Int64}}}, static::Bool)
@ Base.Threads ./threadingconstructs.jl:172
[2] macro expansion
@ ./threadingconstructs.jl:220 [inlined]
[3] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/Utils.jl:155 [inlined]
[4] optimize_and_simplify_population(dataset::Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, pop::Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, curmaxsize::Int64, record::Dict{String, Any})
@ SymbolicRegression.SingleIterationModule ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SingleIteration.jl:111
[5] _dispatch_s_r_cycle(in_pop::Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, dataset::Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}; pop::Int64, out::Int64, iteration::Int64, verbosity::Int64, cur_maxsize::Int64, running_search_statistics::SymbolicRegression.AdaptiveParsimonyModule.RunningSearchStatistics)
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:1150
[6] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:841 [inlined]
[7] macro expansion
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SearchUtils.jl:118 [inlined]
[8] _warmup_search!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}, Tuple{Population{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, HallOfFame{Float32, Float32, Expression{Float32, Node{Float32}, @NamedTuple{operators::Nothing, variable_names::Nothing}}}, Dict{String, Any}, Float64}, Channel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:serial, 1, false}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}})
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:836
[9] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:serial, 1, false}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, saved_state::Nothing)
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:622
[10] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, Expression, @NamedTuple{}, false, false, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}, ADTypes.AutoEnzyme{Nothing}}, parallelism::Symbol, numprocs::Nothing, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Nothing, verbosity::Nothing, progress::Nothing, v_dim_out::Val{1})
@ SymbolicRegression ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:595
[11] equation_search
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:473 [inlined]
[12] #equation_search#26
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:436 [inlined]
[13] equation_search
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:382 [inlined]
[14] #equation_search#28
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/src/SymbolicRegression.jl:466 [inlined]
[15] top-level scope
@ ~/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
in expression starting at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
And when I remove that @threads call, I get:
> julia --startup-file=no --project=. --threads=1 example.jl
ERROR: LoadError: StackOverflowError:
in expression starting at /Users/mcranmer/PermaDocuments/SymbolicRegressionMonorepo/SymbolicRegression.jl/example.jl:14
Hahaha I found a hack to make it work. Increase the stack size 😆
with_stack(f, n) = fetch(schedule(Task(f, n)))
hall_of_fame = with_stack(2_000_000_000) do
equation_search(X, y; niterations=40, options=options, parallelism=:serial)
end
This actually works in the single-threaded version. It's even surprisingly pretty fast in compilation! (Maybe the multiple threads were interfering with each other?)
But... why is the stack exploding?
(Also... this crashes in multi-threaded mode, see below. Maybe spawned tasks don't inherit the parent task's stack size?)
fatal: error thrown and no exception handler available.
ErrorException("attempt to switch to exited task")
ijl_error at /Users/mcranmer/PermaDocuments/julia/src/rtutils.c:41
ijl_switch at /Users/mcranmer/PermaDocuments/julia/src/task.c:634
fatal: error thrown and no exception handler available.
ErrorException("attempt to switch to exited task")
try_yieldto at ./task.jl:921
ijl_error at /Users/mcranmer/PermaDocuments/julia/src/rtutils.c:41
ijl_switch at /Users/mcranmer/PermaDocuments/julia/src/task.c:634
wait at ./task.jl:995
try_yieldto at ./task.jl:921
task_done_hook at ./task.jl:675
wait at ./task.jl:995
jfptr_task_done_hook_75409 at /Users/mcranmer/PermaDocuments/julia/usr/lib/julia/sys.dylib (unknown line)
task_done_hook at ./task.jl:675
_jl_invoke at /Users/mcranmer/PermaDocuments/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mcranmer/PermaDocuments/julia/src/gf.c:3077
jfptr_task_done_hook_75409 at /Users/mcranmer/PermaDocuments/julia/usr/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/mcranmer/PermaDocuments/julia/src/./julia.h:1982 [inlined]
jl_finish_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:320
_jl_invoke at /Users/mcranmer/PermaDocuments/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/mcranmer/PermaDocuments/julia/src/gf.c:3077
start_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:1249
jl_apply at /Users/mcranmer/PermaDocuments/julia/src/./julia.h:1982 [inlined]
jl_finish_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:320
start_task at /Users/mcranmer/PermaDocuments/julia/src/task.c:1249
Quick updates:
- I can do multi-threaded searches with this too! I just wrap the Enzyme call itself:
with_stacksize(8 * 1024 * 1024) do
autodiff(
Reverse,
evaluator,
Duplicated(g.f.tree, g.extra.storage_tree),
Duplicated(g.f.dataset, g.extra.storage_dataset),
Const(g.f.options),
Const(g.f.idx),
Duplicated(output, doutput),
)
end
and this seems to fix things. It works with a full multi-threaded SymbolicRegression.jl search, being used by Optim.jl for optimizing constants – even with the new ParametricExpression that stores constants in multiple places.
- I have no idea if this
TaskAPI is safe or not. So I posted on discourse here: https://discourse.julialang.org/t/raising-the-roof-increased-stack-size-shenanigans-in-enzyme/116511?u=milescranmer
Edit: The default Julia stack size is 4MB. I am requesting 8MB in the task.
The stack size trick (/hack) is magic, Enzyme is rock solid now for me. So this is as good as closed. Thanks for all the help!!
P.S., It works so well that I wonder if you would consider putting in a modified stack size directly into Enzyme? The default Julia stack size of 4 MB (or 2 MB on 32-bit systems) seems tiny for what Enzyme needs: (posted an issue here: https://github.com/JuliaLang/julia/issues/54998). I feel like Enzyme should even request 32 MB to be safe.
It also would give more reliable stack overflow errors if your compilation thread always starts with the same stack size.
I think part of the difficult in debugging this was because the stack overflows were random:
- If the root task is the first to lock the compile cache, it might hit the stack overflow, since it carries over its stack from the underlying process.
- If a thread is the first to lock the compile cache, it might work, because that thread has a clean stack and more room!
I know that sometimes languages allocate threads with smaller stack sizes, so maybe that muddied things up as well (maybe a thread started doing the Enzyme compilation, and hit a stack overflow, and basically locked the compile cache forever)
P.P.S., I realised that it gets even trickier because of the caching. If one thread is compiling, hits a stack overflow, but cached some of its work, then theoretically could another thread get deeper in the AD process (since it no longer needs to travel through those particular frames)?
Maybe I'm imagining things but I also feel like the larger stack size made compilation much faster... Maybe it was from some threads dying off from stack overflows, and the larger stack just let the first one run to completion..
I do think we need to actually check what is creating such deep stacks here
Compilers are not supposed to be recursive and if they are we should transition to a worklist approach
I honestly don’t know. I think even without recursion you could hit this stack overflow, see my experiments here: https://discourse.julialang.org/t/experiments-with-julia-stack-sizes-and-enzyme/116511/2?u=milescranmer
If the workload is big enough, and Enzyme gets called deep enough in the existing stack, I think it’s doable for it to hit an overflow with Julia’s 4 MB stack size if the stack frames are hefty
For my case — I was calling Optim deep inside SymbolicRegression, and that Optim optimization (which is also quite deep with all the line search stuff) wraps an Enzyme-computed gradient of a loss function which itself calls deep into DynamicExpressions (which uses recursive evaluations).
Even creating a single Task at the start of an enzyme call is probably a good idea, since it will start a fresh stack. Wdyt?
For fun can you set the environmental bar ENABLE_GDBLISTENER=1 and get the backtrace of the stack overflow in gdb. That would hopefully let us see what Julia functions cause relevant issues here
I don't think gdb is available on macOS unfortunately. Is lldb okay?
Yeah this works for lldb as well.
You can also now use the Julia profiler during compilation, since it doesn't hang anymore to get some idea of the stack traces involved