julia change compiler to be stackless

This change ensures the compiler uses very little stack, making it compatible with running on any arbitrary system stack size and depths much more reliably. It also could be further modified now to easily add various forms of pause-able/resumable inference, since there is no implicit state on the stack--everything is local and explicit now.

Whereas before, less than 900 frames would crash in less than a second:

$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(1000))'
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
Internal error: during type inference of
f(Base.Val{1000})
Encountered stack overflow.
This might be caused by recursion over very long tuples or argument lists.

[23763] signal 6: Abort trap: 6
in expression starting at none:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 1 (Pool: 1; Big: 0); GC: 0
Abort trap: 6

real	0m0.233s
user	0m0.165s
sys	0m0.049s

Now: it is effectively unlimited, as long as you are willing to wait for it:

$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(50000))'
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 2500 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 5000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 10000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 20000 frames (may be slow).
info: inference of f(Base.Val{50000}) from f(Base.Val{N}) where {N} exceeding 40000 frames (may be slow).
real	7m4.988s

$ time ./julia -e 'f(::Val{N}) where {N} = N <= 0 ? 0 : f(Val(N - 1)); f(Val(1000))'
real	0m0.214s
user	0m0.164s
sys	0m0.044s

$ time ./julia -e '@noinline f(::Val{N}) where {N} = N <= 0 ? GC.safepoint() : f(Val(N - 1)); f(Val(5000))'
info: inference of f(Base.Val{5000}) from f(Base.Val{N}) where {N} exceeding 2500 frames (may be slow).
info: inference of f(Base.Val{5000}) from f(Base.Val{N}) where {N} exceeding 5000 frames (may be slow).
real	0m8.609s
user	0m8.358s
sys	0m0.240s

@nanosoldier runbenchmarks("inference", vs=":master")

Aug 23 '24 19:08 vtjnash

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Aug 23 '24 20:08 nanosoldier

I mostly agree with this approach. It would be very nice to make inference resumable. There are some performance concerns with creating a large number of closures, but we can probably address those with some effort.

One question I have is: what's the roadmap for extending this to parallel inference? It seems like there are some constraints on the order of scheduled inference tasks that need to be consumed, so it doesn’t seem like we can just parallelize it without careful consideration.

Aug 30 '24 09:08 aviatesk

Also, we might need an alternative for the timing framework. It was never entirely accurate in the presence of recursive calls, so it might make sense to remove it altogether. However, I think there are still tools like SnoopCompile that rely on it. Maybe we should invent a new tool for that, which probably could be implemented externally?

Aug 30 '24 09:08 aviatesk

@nanosoldier runbenchmarks("inference", vs=":master")

Aug 30 '24 09:08 aviatesk

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Aug 30 '24 10:08 nanosoldier

I mostly agree with this approach. It would be very nice to make inference resumable. There are some performance concerns with creating a large number of closures, but we can probably address those with some effort.

The benchmarks results seem to indicate this is mostly a non-issue. One alternative I think may be to augment Future with something immutable like a NowOrLater like this, so there is potentially no allocation required on the direct path (but more memory potentially required, as it may allocate the space to store the result twice):

# const Future{T} = RefValue{T} # if we don't need assertions, then this can be simply a Ref, as UndefVarError will usually also already track this state
struct NowOrLater{T}
    later::Union{Nothing, Future{T}}
    now::T
    NowOrLater{T}() where {T} = new{T}(Future{T}())
    NowOrLater{T}(now::T) where {T} = new{T}(nothing, now)
end
NowOrLater(x) = NowOrLater{typeof(x})(x)
getproperty(v::NowOrLater, s::Symbol) = (later = getfield(v, :later); s === :completed ? later === nothing || later.completed : error("not a field"))
getindex(v::NowOrLater) = (later = getfield(v, :later); later === nothing ? getfield(v, :now) : later[])
setindex!(v::NowOrLater, x) = (later = getfield(v, :later); later === nothing ? error("constructed with a value") : later[] = x)
function NowOrLater{T}(f, v::NowOrLater, args...) do args...
    later = getfield(v, :later)
    if later === nothing
        return NowOrLater{T}(f(getfield(v, :now), args...)::T)
    else
        let v = NowOrLater{T}()
            push!(sv.workq, (args...,) -> (v = f(later[], args...); nothing))
            return v
        end
    end
end

One question I have is: what's the roadmap for extending this to parallel inference? It seems like there are some constraints on the order of scheduled inference tasks that need to be consumed, so it doesn’t seem like we can just parallelize it without careful consideration.

I have no idea actually. There is a lot more hidden state than that, which would need to be solved at the function level. Inside the function, it is nearly possible to resolve the call with an empty result, then continue from there (as if it was a cycle), but there are still quite a number of gotchas.

Sep 03 '24 23:09 vtjnash

Comparing with the alternative (immutable) Future design: @nanosoldier runbenchmarks("inference", vs=":master")

Sep 04 '24 15:09 vtjnash

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Sep 04 '24 16:09 nanosoldier

Ah right, exactly the same: we don't permit inlining this struct, since the field is #undef, and that becomes tricky to handle after inlining (or stack allocating) it for return. We may first need to implement the support for Union{Nothing, T} to always return nothing if T is #undef (regardless of whether T got inlined)

Sep 04 '24 17:09 vtjnash

I think we can merge this soon, to avoid it getting more stale, and when the union-pointer-abi work is finished (https://github.com/JuliaLang/julia/pull/55045), it should recover the allocation benchmark, while we haven't seen too much change in overall performance: @nanosoldier runbenchmarks("inference", vs=":master")

Sep 26 '24 19:09 vtjnash

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Sep 26 '24 22:09 nanosoldier