change compiler to be stackless
This change ensures the compiler uses very little stack, making it compatible with running on any arbitrary system stack size and depths much more reliably. It also could be further modified now to easily add various forms of pause-able/resumable inference, since there is no implicit state on the stack--everything is local and explicit now.
Whereas before, less than 900 frames would crash in less than a second:
$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(1000))'
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
Internal error: during type inference of
f(Base.Val{1000})
Encountered stack overflow.
This might be caused by recursion over very long tuples or argument lists.
[23763] signal 6: Abort trap: 6
in expression starting at none:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 1 (Pool: 1; Big: 0); GC: 0
Abort trap: 6
real 0m0.233s
user 0m0.165s
sys 0m0.049s
Now: it is effectively unlimited, as long as you are willing to wait for it:
$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(50000))'
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 2500 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 5000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 10000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 20000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 40000 frames (may be slow).
real 7m4.988s
$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(1000))'
real 0m0.214s
user 0m0.164s
sys 0m0.044s
$ time ./julia -e '@noinline f(::Val{N}) where {N} = N <= 0 ? GC.safepoint() : f(Val(N - 1)); f(Val(5000))'
info: inference of f(Base.Val{5000}) from f(Base.Val{N}) where {N} exceeding 2500 frames (may be slow).
info: inference of f(Base.Val{5000}) from f(Base.Val{N}) where {N} exceeding 5000 frames (may be slow).
real 0m8.609s
user 0m8.358s
sys 0m0.240s
@nanosoldier runbenchmarks("inference", vs=":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
I mostly agree with this approach. It would be very nice to make inference resumable. There are some performance concerns with creating a large number of closures, but we can probably address those with some effort.
One question I have is: what's the roadmap for extending this to parallel inference? It seems like there are some constraints on the order of scheduled inference tasks that need to be consumed, so it doesn’t seem like we can just parallelize it without careful consideration.
Also, we might need an alternative for the timing framework. It was never entirely accurate in the presence of recursive calls, so it might make sense to remove it altogether. However, I think there are still tools like SnoopCompile that rely on it. Maybe we should invent a new tool for that, which probably could be implemented externally?
@nanosoldier runbenchmarks("inference", vs=":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
I mostly agree with this approach. It would be very nice to make inference resumable. There are some performance concerns with creating a large number of closures, but we can probably address those with some effort.
The benchmarks results seem to indicate this is mostly a non-issue. One alternative I think may be to augment Future with something immutable like a NowOrLater like this, so there is potentially no allocation required on the direct path (but more memory potentially required, as it may allocate the space to store the result twice):
# const Future{T} = RefValue{T} # if we don't need assertions, then this can be simply a Ref, as UndefVarError will usually also already track this state
struct NowOrLater{T}
later::Union{Nothing, Future{T}}
now::T
NowOrLater{T}() where {T} = new{T}(Future{T}())
NowOrLater{T}(now::T) where {T} = new{T}(nothing, now)
end
NowOrLater(x) = NowOrLater{typeof(x})(x)
getproperty(v::NowOrLater, s::Symbol) = (later = getfield(v, :later); s === :completed ? later === nothing || later.completed : error("not a field"))
getindex(v::NowOrLater) = (later = getfield(v, :later); later === nothing ? getfield(v, :now) : later[])
setindex!(v::NowOrLater, x) = (later = getfield(v, :later); later === nothing ? error("constructed with a value") : later[] = x)
function NowOrLater{T}(f, v::NowOrLater, args...) do args...
later = getfield(v, :later)
if later === nothing
return NowOrLater{T}(f(getfield(v, :now), args...)::T)
else
let v = NowOrLater{T}()
push!(sv.workq, (args...,) -> (v = f(later[], args...); nothing))
return v
end
end
end
One question I have is: what's the roadmap for extending this to parallel inference? It seems like there are some constraints on the order of scheduled inference tasks that need to be consumed, so it doesn’t seem like we can just parallelize it without careful consideration.
I have no idea actually. There is a lot more hidden state than that, which would need to be solved at the function level. Inside the function, it is nearly possible to resolve the call with an empty result, then continue from there (as if it was a cycle), but there are still quite a number of gotchas.
Comparing with the alternative (immutable) Future design:
@nanosoldier runbenchmarks("inference", vs=":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
Ah right, exactly the same: we don't permit inlining this struct, since the field is #undef, and that becomes tricky to handle after inlining (or stack allocating) it for return. We may first need to implement the support for Union{Nothing, T} to always return nothing if T is #undef (regardless of whether T got inlined)
I think we can merge this soon, to avoid it getting more stale, and when the union-pointer-abi work is finished (https://github.com/JuliaLang/julia/pull/55045), it should recover the allocation benchmark, while we haven't seen too much change in overall performance: @nanosoldier runbenchmarks("inference", vs=":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.