CUDA.jl Explore early finalization

Julia support early finalization insertion, https://github.com/JuliaLang/julia/pull/45272, however that does not trigger here because CuArrays' finalizers taints the TLS effect. Keno suggested just untainting that using @assume_effects, which we should explore.

If that doesn't work / In addition, @aviatesk suggested exploring integrating early finalization insertion with escape analysis, which may make this optimization even more potent.

Let's first start with the untainting and coming up with a couple of MWEs to look into.

cc @jpsamaroo

Jul 12 '24 14:07 maleadt

@vchuravy noted that this may also extend the lifetime of objects, up until the inserted finalizer whereas the GC could have collected it already if there were no outstanding references. Similar issues: https://github.com/JuliaLang/julia/issues/51818, https://github.com/JuliaLang/julia/issues/52533 (https://github.com/JuliaGPU/CUDA.jl/issues/2197). I'm not sure this is a blocker, but it's something to keep in mind, e.g., to make sure the finalizer is inserted as aggressively as possible:

a = CuArray(...)

if likely()
    a = nothing
    # finalizer should be inserted here
    while something_long()
        # ...
    end
else
    use(a)
end

# finalizer should not be inserted here,
# or `a` would be kept alive across the while loop

The above may not match how the finalizer insertion pass currently works; I haven't properly read into it yet.

Jul 13 '24 08:07 maleadt

I believe the biggest current blocker is that the finalizer inlining pass currently assumes that all operations on the target object are inlined. In other words, even simple code like the following cannot currently perform finalizer inlining:

@noinline function use(a)
   ... # uses a, but doesn't escape it to anywhere
end

let 
    Base.Experimental.@force_compile
    a = CuArray(...)
    use(a)
end

Here, using EA to analyze that use(x) does not escape x and enabling finalizer inlining would be the first step. Specifically, could you come up with a concrete target code like the simple case above? I would like to use it to test EA ability and start optimizing for the simplest cases. Aggressive finalizer inlining in cases involving branches is also important, but let's start with the simple cases first.

Jul 17 '24 17:07 aviatesk

A while back I came up with the following example for where currently the GC fails us. I didn't write it with early finalization in mind and it was more a test-bed for automatic reference counting, but the idea holds.

For me early finalization is too fragile and too dependent on inlining.

mutable struct ForeignBuffer{T}
     const ptr::Ptr{T}
end

import Base: Libc

mutable struct HeapTracker
    const lock::Base.Threads.SpinLock
    const dict::Dict{Ptr{Cvoid}, Int}
    @atomic size::Int
    HeapTracker() = new(Base.Threads.SpinLock(), Dict{Ptr{Cvoid}, UInt}(), 0)
end

Base.lock(t::HeapTracker) = lock(t.lock)
Base.unlock(t::HeapTracker) = unlock(t.lock)
const TRACKER = HeapTracker()

function tracked_malloc(size)
    local ptr
    @lock TRACKER begin
        @atomic TRACKER.size += size
        ptr = Libc.malloc(size)
        TRACKER.dict[ptr] = size
    end
    ptr
end

function tracked_free(ptr::Ptr)
    ptr = Base.unsafe_convert(Ptr{Cvoid}, ptr)
    @lock TRACKER begin
        if !haskey(TRACKER.dict, ptr)
            error("Double free")
        end
        size = pop!(TRACKER.dict, ptr)
        @atomic TRACKER.size -= size
        Libc.free(ptr)
    end
end

function stats()
    @info "Foreign heap size (bytes)" heap=TRACKER.size
end

function foreign_alloc(::Type{T}, length) where T
    ptr = tracked_malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    finalizer(obj->tracked_free(obj.ptr), obj)
end

function main(N, iterations)
    for _ in 1:iterations
        workspace = foreign_alloc(Float64, N)
        GC.@preserve workspace begin
            ptr = workspace.obj
            # ... use ptr
        end
        stats()
    end
end

Jul 17 '24 18:07 vchuravy

That is too complex, so I would prefer a simpler target if possible. CUDA might also be complex once lowered though.

Jul 17 '24 18:07 aviatesk

This is pretty much the simplest thing. The only complexity here is the allocation tracking so that you can immediatly tell if you are successful. You can remove the tracker...

mutable struct ForeignBuffer{T}
     const ptr::Ptr{T}
end

import Base: Libc

# unlikely to be inlined
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    finalizer(obj->Libc.free(obj.ptr), obj)
   obj
end

function main(N, iterations)
    for _ in 1:iterations
        workspace = foreign_alloc(Float64, N)
        GC.@preserve workspace begin
            ptr = workspace.obj
            # ... use ptr
        end
    end
end

Jul 17 '24 19:07 vchuravy

Started some work at https://github.com/JuliaLang/julia/pull/55954.

Oct 01 '24 12:10 aviatesk