Explore early finalization
Julia support early finalization insertion, https://github.com/JuliaLang/julia/pull/45272, however that does not trigger here because CuArrays' finalizers taints the TLS effect. Keno suggested just untainting that using @assume_effects, which we should explore.
If that doesn't work / In addition, @aviatesk suggested exploring integrating early finalization insertion with escape analysis, which may make this optimization even more potent.
Let's first start with the untainting and coming up with a couple of MWEs to look into.
cc @jpsamaroo
@vchuravy noted that this may also extend the lifetime of objects, up until the inserted finalizer whereas the GC could have collected it already if there were no outstanding references. Similar issues: https://github.com/JuliaLang/julia/issues/51818, https://github.com/JuliaLang/julia/issues/52533 (https://github.com/JuliaGPU/CUDA.jl/issues/2197). I'm not sure this is a blocker, but it's something to keep in mind, e.g., to make sure the finalizer is inserted as aggressively as possible:
a = CuArray(...)
if likely()
a = nothing
# finalizer should be inserted here
while something_long()
# ...
end
else
use(a)
end
# finalizer should not be inserted here,
# or `a` would be kept alive across the while loop
The above may not match how the finalizer insertion pass currently works; I haven't properly read into it yet.
I believe the biggest current blocker is that the finalizer inlining pass currently assumes that all operations on the target object are inlined. In other words, even simple code like the following cannot currently perform finalizer inlining:
@noinline function use(a)
... # uses a, but doesn't escape it to anywhere
end
let
Base.Experimental.@force_compile
a = CuArray(...)
use(a)
end
Here, using EA to analyze that use(x) does not escape x and enabling finalizer inlining would be the first step.
Specifically, could you come up with a concrete target code like the simple case above? I would like to use it to test EA ability and start optimizing for the simplest cases. Aggressive finalizer inlining in cases involving branches is also important, but let's start with the simple cases first.
A while back I came up with the following example for where currently the GC fails us. I didn't write it with early finalization in mind and it was more a test-bed for automatic reference counting, but the idea holds.
For me early finalization is too fragile and too dependent on inlining.
mutable struct ForeignBuffer{T}
const ptr::Ptr{T}
end
import Base: Libc
mutable struct HeapTracker
const lock::Base.Threads.SpinLock
const dict::Dict{Ptr{Cvoid}, Int}
@atomic size::Int
HeapTracker() = new(Base.Threads.SpinLock(), Dict{Ptr{Cvoid}, UInt}(), 0)
end
Base.lock(t::HeapTracker) = lock(t.lock)
Base.unlock(t::HeapTracker) = unlock(t.lock)
const TRACKER = HeapTracker()
function tracked_malloc(size)
local ptr
@lock TRACKER begin
@atomic TRACKER.size += size
ptr = Libc.malloc(size)
TRACKER.dict[ptr] = size
end
ptr
end
function tracked_free(ptr::Ptr)
ptr = Base.unsafe_convert(Ptr{Cvoid}, ptr)
@lock TRACKER begin
if !haskey(TRACKER.dict, ptr)
error("Double free")
end
size = pop!(TRACKER.dict, ptr)
@atomic TRACKER.size -= size
Libc.free(ptr)
end
end
function stats()
@info "Foreign heap size (bytes)" heap=TRACKER.size
end
function foreign_alloc(::Type{T}, length) where T
ptr = tracked_malloc(sizeof(T) * length)
ptr = Base.unsafe_convert(Ptr{T}, ptr)
obj = ForeignBuffer{T}(ptr)
finalizer(obj->tracked_free(obj.ptr), obj)
end
function main(N, iterations)
for _ in 1:iterations
workspace = foreign_alloc(Float64, N)
GC.@preserve workspace begin
ptr = workspace.obj
# ... use ptr
end
stats()
end
end
That is too complex, so I would prefer a simpler target if possible. CUDA might also be complex once lowered though.
This is pretty much the simplest thing. The only complexity here is the allocation tracking so that you can immediatly tell if you are successful. You can remove the tracker...
mutable struct ForeignBuffer{T}
const ptr::Ptr{T}
end
import Base: Libc
# unlikely to be inlined
function foreign_alloc(::Type{T}, length) where T
ptr = Libc.malloc(sizeof(T) * length)
ptr = Base.unsafe_convert(Ptr{T}, ptr)
obj = ForeignBuffer{T}(ptr)
finalizer(obj->Libc.free(obj.ptr), obj)
obj
end
function main(N, iterations)
for _ in 1:iterations
workspace = foreign_alloc(Float64, N)
GC.@preserve workspace begin
ptr = workspace.obj
# ... use ptr
end
end
end
Started some work at https://github.com/JuliaLang/julia/pull/55954.